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Checkpointing and rollback recovery are techniques that can provide efficient recovery 
from transient process failures. In a message- passing system, the rollback of a message 
sender may cause the rollback of the corresponding receiver, and the system needs to 
roll back to a consistent set of checkpoints called the recovery line. If the processes are 
allowed to take uncoordinated checkpoints, the above rollback propagation may result in 
the domino effect which prevents recovery line progression. Traditionally, only obsolete 
checkpoints before the global recovery line can be discarded, and the necessary and 
sufficient condition for identifying all garbage checkpoints has remained an open problem. 

In this thesis, we derive a necessary and sufficient condition for achieving optimal 
garbage collection, and we prove that the number of useful checkpoints is in fact bounded 
by N(N + l)/2 where N is the number of processes. Our approach is based on the 
maximum-sized antichain model of consistent global checkpoints and the technique of 
recovery line transformation and decomposition. We also show that, for systems requiring 
message logging to record in-transit messages, the same approach can be used to achieve 
optimal message log reclamation. As a final topic, we describe a unifying framework by 
considering checkpoint coordination and exploiting piecewise determinism as mechanisms 
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for bounding rollback propagation, and demonstrate the applicability of the optimal 
garbage collection algorithm to domino-free recovery protocols. 
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1. INTRODUCTION 


1.1 Checkpointing and Rollback Recovery 

Checkpointing and rollback recovery provide for recovery from transient process fail- 
ures. During normal execution, the state of each process is periodically saved on stable 
storage as a checkpoint. When a failure occurs, the process can roll back to a previ- 
ous checkpoint by reloading the checkpointed state to avoid costly reexecution from the 
very beginning. In a message-passing system, rollback propagation can occur when the 
rollback of a message sender results in the rollback of the corresponding receiver. The 
system is then required to roll back to the latest available consistent set of checkpoints 
called the recovery line to ensure correct recovery with a minimum amount of rollback. 
In the worst case, cascading rollback propagation [1] may result in the domino effect [2, 3] 
which prevents recovery line progression. 

Numerous checkpointing and recovery techniques for message-passing systems have 
been proposed in the literature. They can be classified into three primary categories: 
uncoordinated checkpointing, coordinated checkpointing and the log-based approach. 
Uncoordinated checkpointing [4-6] allows each process to take its checkpoints inde- 
pendently, without coordinating with any other processes. It allows maximum process 
autonomy and general nondeterministic executions, but suffers from potential domino 
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effects and the large space overhead for maintaining multiple checkpoints of each pro- 
cess. Processes axe allowed to take uncoordinated checkpoints, and the dependencies 
among the checkpoints caused by message communication axe recorded through depen- 
dency tracking. The recovery line is unknown during normal execution and is computed 
at the time of recovery based on the dependency information. Rollback propagation can 
be eliminated by taking a checkpoint immediately after sending every message [7], and 
domino-free recovery can be achieved by inserting a checkpoint before processing any 
message carrying a new dependency [8,9], or by inserting a checkpoint between every 
pair of consecutive send and receive events (in that order) [1]. 

Coordinated checkpointing eliminates the domino effect by sacrificing a certain 
degree of process autonomy and incurring run-time and message overhead. Usually, 

4 

whenever a process takes a checkpoint, it broadcasts a coordination message to force all 
of the other processes to take appropriate checkpoints to guarantee that the resulting set 
of checkpoints is consistent [10-18]. The number of processes required to participate in 
each checkpointing session can be reduced by monitoring the recent message exchanges 
[19]. The extra message overhead can be avoided by piggybacking the coordination 
messages on subsequent normal messages [20-22], or by taking advantage of the existing 
clock synchronization mechanisms [23-25]. 

The log-based approach assumes the piecewise deterministic execution modei [26] 
which views process execution as consisting of a number of deterministic state intervals , 
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each started by a nondeterministic event such as processing a new message. Nonde- 
terministic event logging, in addition to checkpointing, is employed to reduce rollback 
propagation through deterministic state reconstruction. Synchronous message logging 
protocols [27-29] log each message upon receipt. Since the process state from which 
any message is sent can always be reconstructed through message, replaying, rollback 
propagation is completely eliminated. Asynchronous message logging protocols [26, 30- 
41] reduce logging overhead by grouping several messages in a single write operation to 
stable storage. Although rollback propagation may occur when not-yet- logged messages 
are lost upon a failure, recovery line progression is guaranteed as long as every message 
is eventually logged [26,33]. 

The main focus of this thesis is on uncoordinated checkpointing and, in particular, the 
garbage collection procedure for reclaiming the storage space of those checkpoints that 
are no longer useful. Traditionally, garbage collection for uncoordinated checkpointing 
has been based on the notion of obsolete checkpoints : the global recovery line which 
suffices to recover from the failure of the entire system is computed; then all of the 
obsolete checkpoints before that recovery line are no longer useful and can be discarded. 
In contrast, all of the nonobsolete checkpoints have been assumed to be possibly useful 
for some future recovery and should be retained. With the possibility of domino effects, 
the number of nonobsolete checkpoints is potentially unbounded. 

Motivated by the observation that being obsolete is simply a sufficient condition for 
being garbage, we derive a necessary and sufficient condition for identifying all garbage 
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checkpoints, which leads to an optimal gaxbage collection algorithm and the lowest upper 
bound on the number of nongarbage checkpoints. Our approach is to model consistent 
global checkpoints as maximum-sized antichains of the partially ordered set generated 
by the happened before relation between the checkpoints. We define a recovery line 
transformation and decomposition, and we demonstrate that any nongarbage checkpoint 
belonging to a possible future recovery line must also be contained in one of the N “im- 
mediate future” recovery lines, where N is the number of processes. It is also shown that 
these N recovery lines can contain at most N(N + l)/2 distinct nongarbage checkpoints. 

Usually, the in- transit messages, i.e., messages “sent but not yet received” with re- 
spect to a set of checkpoints, are assumed to be handled by a reliable transmission 
protocol and do not result in checkpoint inconsistency. We point out that to support 
the above assumption, the acknowledgement message for every normal message has to 
be considered as am additional dependency-carrying message which would result in extra 
rollback propagation. An alternative way of retrieving the in- transit messages is to use 
message logging. The message logs then constitute another source of space overhead. We 
demonstrate that the same approach based on recovery line transformation and decom- 
position can be used to develop am optimal message log reclamation algorithm. More 
specifically, we show that any message that can possibly become an in-transit message in 
the future must also be an in-transit message with respect to one of the N “immediate 
future” recovery lines. 
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Our optimal garbage collection algorithm addresses the space overhead issue of unco- 
ordinated checkpointing, but the possibility of domino effects still remains. In Chapter 3, 
we extend the applicability of the algorithm to a domino-free unifying framework. Tra- 
ditionally, uncoordinated checkpointing, coordinated checkpointing, and the log-based 
approach have been considered three separate approaches, each with its own advantages 
and disadvantages. The unifying framework provides a different point of view by consid- 
ering uncoordinated checkpointing as the basic and the most general scheme because it 
does not require process execution to satisfy the piecewise deterministic model. Check- 
point coordination and message logging for exploiting piecewise determinism are then 
considered two mechanisms for bounding rollback propagation. We propose a lazy check- 
point coordination technique [22] to allow sacrificing a varying degree of process autonomy 
in exchange for a guarantee of recovery line progression. Message logging whenever piece- 
wise determinism is available is interpreted as placing additional logical checkpoints [42] 
at the end of the state intervals, thereby reducing the rollback distances and hence the 
possibility of rollback propagation. The unifying framework and the optimad garbage col- 
lection algorithm together then provide a flexible, effective, economic way of recovering 
from transient process failures. 

1.2 Checkpoint Consistency 

The system considered in this thesis consists of a number of concurrent processes for 
which all process communication is through message passing. Processes are assumed to 
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run on fail-stop processors [43], i.e., no corrupted messages can be sent. All processes 
running on the same recovery unit [26] will be rolled back together in response to a failure. 
For the purpose of presentation, we consider each process to be an individual recovery 
unit. To allow general nondeterministic execution, we do not assume the piecewise de- 
terministic model. This implies that whenever the sender of a message m rolls back to 
a point before m was sent to unsend m, the corresponding receiver must also roll back 
to a point before m was processed in order to unprocess m. 1 Let c,- jX denote the xth 
checkpoint (x > 0) of process pi. Figure 1.1(a) gives such an example. Suppose process 
Pj rolls back to c hy . Due to the the potential nondeterminism preceding the sending of 
m, pj can not guarantee the regeneration of an exact copy of m during its reexecution 
(even under the fail-stop assumption). Thus, p,’s execution based on the processing of 
m is no longer valid and p, should also roll back to nullify the effect of m. The message 
m which is unsent by pj is called an orphan message and results in the inconsistency 
between c, ( „ and c^+i. The two checkpoints thus cannot be used together for recovery. 

In contrast, Fig. 1.1(b) shows another situation in which message m' is recorded as 
“sent but not yet received” and hence is called tin in-transit message with respect to the 
two checkpoints c, iX and cj iV . Suppose that process rolls back to Ci tX and unreceives 
m'. A straightforward way of handling such a situation is to also roll back pj to unsend 
m', a mechanism we call in-transit message invalidation. However, such invalidation can 

1 We say a message is received by the destination processor and then later processed by the destination 
process. A message results in dependency only after it is processed. 
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Figure 1.1: Checkpoint consistency (solid line for message processing; dashed line for 
message receipt), (a) Orphan message m and inconsistent checkpoints c,, x+1 
and Cj,„; (b) in- transit message m' and consistent checkpoints c, iX and c hy . 

result in excessive rollback propagation and a higher probability of domino effects. An- 
other commonly used mechanism can be called in-transit message retrieval. If during 
p,-’s reexecution from c,-^, message m! can be retrieved from a message log or through 
an end-to-end transmission protocol, then pi need not request pj to unsend m! . Sev- 
eral approaches employing the above two mechanisms to handle in-transit messages axe 
summarized in the following. The first and the fourth approaches use a combination of 
invalidation and retrieved; the other two approaches axe based completely on the retrieval 
mechanism. 

Approach 1: reliable end-to-end transmission protocol 

Koo and Toueg [19] argued that the situation of message m' with respect to the 
checkpoints c,^ and c JiV in Fig. 1.2(a) is indistinguishable from the situation in which 
m' is lost in the communication channel during normad execution. Therefore, a reliable 
end-to-end transmission protocol, which can guarantee the retransmission of any lost 
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message until it is received by the destination processor, will also be able to retransmit 


m' after the two processes roll back to c,- iX and Cj iV . 



Figure 1.2: In-transit message and checkpoint consistency, (a) In-transit message m'; (b) 
consistent checkpoints c^ x and c JiV ; (c) inconsistent checkpoints c,- >x and c JiS , 
due to message ack. 

However, this is true only for the situation shown in Fig. 1.2(b) where c Jiy is taken 
before pj receives message ack (the acknowledge message for m'), and thus will record 
a copy of message m' as well as the responsibility of retransmitting m' until ack comes 
back. If checkpoint Cj tV is taken after pj receives ack, as shown in Fig. 1.2(c), c hy will 
simply lose the capability of resending m'. It becomes clear that, to distinguish the two 
different scenarios, message ack has to be treated as an additional dependency-carrying 
message which can result in extra rollback propagation. More precisely, in Fig. 1.2(b), 
the rollback of p ,• to requires (through ack ) that pj be rolled bank to c hy so that 
the in-transit message m! can be resent. In Fig. 1.2(c), the rollback of p, to c,, x causes 
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(through ack ) the rollback of pj to a checkpoint earlier than c hy in order to invalidate m 1 . 
We note that the inconsistency between Ci <x and Cj iV in Fig 1.2(c) is due to the orphan 
message ack , not the in-transit message m'. 

Approach 2: synchronous message logging 

Treating every acknowledge message as a dependency-carrying message potentially 
doubles the amount of rollback propagation and makes recovery line progression more 
difficult. Another way to handle in-transit messages is to use message logging. A syn- 
chronous message logging protocol logs every incoming message m' upon its arrival and 
delays the sending of ack message until m' is logged. In this way, if the receiver p, rolls 
back and unreceives m' before it logs m', then the sender will resend m' because the 
corresponding ack is never generated; if p,- initiates the rollback after it logs m', then p, 
can retrieve m' from the log during its reexecution. 

Approach 3: asynchronous message logging with sender logging 

A synchronous logging protocol logs each incoming message separately and can result 
in large performance degradation. Asynchronous logging protocols [26] reduce run-time 
overhead by grouping several messages and logging them later in a single write operation. 
Additional sender logging can be used to maintain a copy of each message which is not yet 
logged by the receiver and can potentially be lost in the presence of a receiver’s failure. 
Every process keeps the messages it has sent in a volatile log [26] and writes them to 
stable storage at the next checkpoint. The sender retains the log for each message until 
it is notified (by a log.ack message) that the message has been logged by the receiver. In 
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this way, every in-transit message can always be retrieved from either the sender’s log or 
the receiver’s log. Figure 1.3(a) and (b) illustrate the difference between Approaches 2 
and 3 by showing the availability of the message log for m! at different times, as indicated 
by the shaded bars. 
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Figure 1.3: Message logging for recording the in-transit messages, (a) Synchronous mes- 
sage logging; (b) asynchronous message logging with sender logging; (c) asyn- 
chronous message logging without sender logging. (Shaded baxs indicate the 
availability of the message log for m'.) 


Approach 4: asynchronous message logging without sender logging 

Without additional sender logging, messages can be lost upon a receiver’s failure in 
an asynchronous logging protocol. One way to remedy such a situation is to compare 
the set of messages sent with the set of messages logged and consider unavailable those 
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checkpoints before which there is any “sent but not yet logged” message [6], as illustrated 
in Fig. 1.3(c). This procedure effectively invalidates till lost messages and ensures that 
in-transit messages with respect to the computed recovery line can all be retrieved from 
the receiver’s log. 

All of the above four approaches can guarantee that messages like m' in Fig. 1.1(b) 
do not cause the inconsistency between Ci^ and c J(V . Therefore, the situation shown in 
Fig. 1.1(a) is the only source of checkpoint inconsistency. 

1.3 Uncoordinated Checkpointing Protocol 

Having addressed the checkpoint consistency issues, we now describe an uncoordi- 
nated checkpointing protocol. Suppose there axe N processes in the system. During 
normal execution, each process takes its local checkpoints periodically without coordi- 
nating with any other processes. Let (i, x ) denote the xth checkpoint interval of process 
Pi between consecutive checkpoints c,^ and c^+i, as shown in Fig. 1.1(a). Each mes- 
sage is tagged with the current checkpoint interval number and the process number of 
the sender. Each receiver p,- performs direct dependency tracking [4, 44] as follows: if a 
message sent from (j, y) is processed by p,- in (i, x), then the direct dependency of c,, I+1 
on Cj, v is recorded. 

A centralized garbage collection algorithm can be invoked by any process p, period- 
ically to reclaim the storage space of garbage checkpoints and possibly message logs (if 
the in- transit messages axe recorded through message logging) that axe no longer useful 
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for any future recovery. First, p, broadcasts a request message to collect the direct de- 
pendency information from all other processes. A checkpoint graph [4] is constructed, in 
which each vertex represents a checkpoint and each edge represents a direct dependency 
(including the implicit dependency of any c J(V+ i on c JiV ). Figure 1.4(b) shows the check- 
point graph corresponding to the checkpoint and communication pattern in (a). The 
rollback propagation algorithm listed in Fig. 1.5 is executed on the checkpoint graph to 
determine the global recovery line, 2 which is then broadcast in a recoveryJine message. 
All checkpoints and message logs before the global recovery line axe obsolete , and their 
space can therefore be reclaimed. Note that processes other than the initiator do not 
have to block their executions between replying to the request message and receiving the 
recoveryJine message. 

When a process pj initiates a rollback, it starts a similar two-phase procedure for 
recovery, except for the following differences. The volatile states of surviving processes 
remain valid and can be viewed as additional virtual checkpoints [5] for constructing 
an extended checkpoint graph of which the recovery line is called a local recovery line. 
Figure. 1.4(c) shows an example in which p 4 initiates a rollback. Every other process is 
blocked, after supplying p 4 with the dependency information, until it rolls back to the 
checkpoint as indicated by the local recovery line. Figure. 1.4(d) shows the checkpoint 
graph immediately after the recovery. 

2 A global recovery line is to be used when the entire system fails, while a local recovery line is 
computed when only a subset of processes becomes faulty. 
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Obsolete checkpoints 




Figure 1.4: Checkpoint graphs and recovery lines, (a) Checkpoint and communication 
pattern; (b) checkpoint graph; (c) extended checkpoint graph; (d) checkpoint 
graph after recovery. 

1.4 A Model of Consistent Global Checkpoints 


1.4.1 Partially ordered sets, antichains and lattices 

A partial order [45] on a set 5 is a relation “<” such that 

(a) Vs € 5, s s. (Irreflexivity) 

(b) If s < t, then t s. (Antisymmetry) 
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/* CP represents a checkpoint */ 

/* Initially, all of the CPs are unmarked */ 

include the latest CP of each process in the root set; 
mark all CPs strictly reachable from any CP in the root set; 
while (at least one CP in the root set is marked) { 

replace each marked CP in the root set by the latest unmarked CP on the 
same process; 

mark all CPs strictly reachable from any CP in the root set; 

} 

the root set is the recovery line. 


Figure 1.5: The rollback propagation algorithm. 

(c) If s < t and t < u, then s < u. (Transitivity) 

The pair ( S , <) is called a partially ordered set , or poset. An element s is minimal if there 
does not exist any element w such that w < s. An element t of 5 is a minimum element 
if t < w for all w in S. Maximal and maximum elements are similarly defined. 

A subset if of 5 is a chain of (5, <) if the elements of H can be enumerated as 
hi, hi, . . . h n such that hi < hi <...< h n . A subset A of S is an antichain of (5, <) if 
a ^ b for all a, b 6 A. An antichain M of ( S , <) with the largest size of any antichain is 
called a maximum-sized antichain. 

Given two elements s and t of a poset ( S , <), we write s < t if s < t or s = t. Any 
element / such that / < s and / < t is called a lower bound of s and t. If there exists 
a lower bound /* such that l < l m for all lower bounds l of s and t, then l* is called the 
greatest lower bound of s and t. Upper bound and least upper bound are similarly defined. 
A lattice is a poset (5, <) which possesses both a greatest lower bound (called the meet 
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and denoted by s A t) and a least upper bound (called the join and denoted by s V t) for 
all s, t 6 5. 

Let P = ( S , <), and let M(P) denote the set of maximum-sized antichains of P. A 
partial order ■< on the maximum-sized antichains can be defined as follows [46]: for any 
M u M 2 €M(P), 

Mi •< M 2 iff for every a x € Mi, there exists a 2 € M 2 such that a x < a 2 . (1.1) 

It has been shown that [46, §13.1-13.2], for any poset P, (A4(P),^) forms a lattice 
and therefore possesses a unique maximum element called the maximal maximum-sized 
antichain and denoted by M*(P). 

1.4.2 Consistent global checkpoints and recovery lines 

The execution of each process in a message-passing system can be viewed as a sequence 
of events, corresponding to the state changes that take place in the process. The collection 
of event sequences for the participating processes forms the execution history of the 
system. The proper granularity at which to view “events” varies from application to 
application. For our purposes, the events of interest axe the sending and receiving of 
messages, and the recording of local checkpoints by individual processes. An execution 
history restricted to these events will be called a checkpoint and communication pattern. 
We assume that the first event in each process i3 an initial local checkpoint event. 

The global set of events appearing in a checkpoint and communication pattern cannot 
be placed naturally in a toted order, as can the events of a single process. Instead, a partial 
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order on the events can be defined as follows. We say that event e\ directly happened 
before event e 2 [19,47], denoted by ei <d e 2 , if 

1. t\ and e 2 axe events in the same process and e\ occurs immediately before e 2 ; or 

2. t\ is the sending of a message m and e 2 is the receiving of m. 

The transitive closure of the <d relation is the happened before relation < [47]. 

Let V be a checkpoint and communication pattern. A global checkpoint of V is a 
set of N local checkpoints, one from each process. Based on the previous description of 
checkpoint consistency, two checkpoints are inconsistent if and only if they axe ordered by 
the happened before relation. For example, Cj <y and c^+i in Fig. 1.1(a) are inconsistent 
because Cj iV < A consistent global checkpoint of V is therefore a global checkpoint 

of which no two constituent checkpoints are ordered by the happened before relation. 
We will denote by Ep the set of events that appear in V, and by Qp the poset generated 
by the happened before relation on those events: Q-p = (Ep, <). Let Rp — ( Cp,< ) be 
the induced subposet [45] of Qp obtained by restricting the < relation to Cp, the set of 
all checkpoints. In the remainder of this section we derive an important characterization 
of consistent global checkpoints related to the poset Rp. 

LEMMA 1 The largest size of any antichain in Rp is N, and every antichain of size 
N includes a checkpoint from each process in V. 

Proof. The initial checkpoints form an antichain of size N and hence the largest size 
of any antichain in Rp is at least N. Because any two checkpoints from the same process 
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must be ordered by the happened before relation, the largest size of any antichain is 
exactly N, and every antichain of size N must include a checkpoint from each process. 
□ 

THEOREM 1 M is a consistent global checkpoint in V if and only if it is a maximum- 
sized antichain in Rp. 

Proof. By definition, a consistent global checkpoint of V is clearly an antichain of 
size N in Qp and therefore in Rp as well (since Rp is an induced subposet of Qp). By 
Lemma 1, it is a maximum-sized antichain in Rp. 

Conversely, if M is a maximum-sized antichain in Rp, then by Lemma 1 it includes 
a local checkpoint from each process in V and these local checkpoints are pairwise un- 
ordered by <. Thus M is a consistent global checkpoint of V. □ 

For a given antichain A of Rp, we let A[i] denote the element of A which is a checkpoint 
of process p<. The following lemma shows that for the poset Rp, Anderson’s global ■< 
relation as defined in Eq. (1.1) reduces to local ordering of checkpoints within each 
process. 

LEMMA 2 For any Ml, M 2 € M{Rp), 

Mi ■< M 2 iff Mi [i] < M 2 [i), for all 0 < i < N — 1. (1-2) 

Proof. Suppose M\ ■< M 2 . For any given i let j be such that M\[i] < M 2 [j ] in 
Rp. Since M t [z] and M 2 [i] are events in the same process, either M 2 [i\ < M\ [z] or 
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M\[i\ < In the first case, we would have M 2 [i\ < Mi[i ] < M 2 [j], contradicting 

the fact that M 2 is an antichain in R-p. Hence M\ [i] < M 2 [i], and this is true for all 
0 < i < N — 1. Conversely, the assumption that M\[i\ < M 2 [t] for any i yields M\ ■< M 2 
by definition of ■<. □ 

From Lemma 2 we see that for any M € M(R-p), M[i\ < M*(R-p)[i\ for «ill 0 < i < 
N — 1, and it follows that M m (R-p) corresponds to the consistent global checkpoint of 
V in which each constituent local checkpoint is as advanced as possible. The antichain 
M‘(Rp) is therefore what we have referred to as the “recovery line” of V with the 
minimum total rollback distance [48]. 

Our development of the optimal garbage collection algorithm will be based on check- 
point graphs rather than the more abstract posets. Given a checkpoint graph G , we 
let M(G) denote the set of maximum-sized anti chains of the poset R^ corresponding to 
the transitive closure of G. The maximal maximum-sized antichain M m (G) is similarly 
defined. We prove in Appendix A that although the two posets R%, and R-p axe not 
the same due to some missing dependencies in R%> resulted from the direct dependency 
tracking mechanism, R%> possesses exactly the same set of maximum-sized antichains as 
does R-p and therefore suffices for our purpose. 
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2. OPTIMAL GARBAGE COLLECTION 


In this chapter, we describe the approach of recovery line transformation and de- 
composition to achieving optimal garbage collection. The term “optimal” means we can 
identify all of the checkpoints and messages logs that are no longer useful for any fu- 
ture recovery, and all of the retained checkpoints and message logs must be useful for 
some possible future recovery. Section 2.1 derives the necessary and sufficient condition 
for identifying all garbage checkpoints, which then leads to an optimal checkpoint recla- 
mation algorithm [49]. Section 2.2 derives the lowest upper bound on the number of 
nongarbage checkpoints. For protocols requiring message logging to record the in-transit 
messages, Section 2.3 derives the necessary and sufficient condition for identifying all 
garbage message logs and develops an optimal message log reclamation algorithm which 
can be combined with the optimal checkpoint reclamation algorithm to minimize the 
space overhead for uncoordinated checkpointing. 

2.1 Optimal Checkpoint Reclamation 

2.1.1 Motivation and problem formulation 

Since a future program execution may contain arbitrary checkpoint dependencies and 
rollbacks, we first describe an execution model to make the problem tractable. An oper- 
ational session [5] is the interval between the start of normal execution and the instance 
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of rollback initiation, as shown in Fig. 2.1. A recovery session immediately follows the 
previous operational session and ends at the resumption of normal execution. A program 
execution can be viewed as consisting of a number of alternating operational sessions and 
recovery sessions. In terms of the effect on the checkpoint graphs, new vertices are added 
as new checkpoints are taken during an operational session, and existing vertices can be 


deleted as some checkpoints are invalidated by the rollback during a recovery session. 
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Figure 2.1: Operational sessions, recovery sessions and nongarbage checkpoints. 


Since the purpose of maintaining checkpoints is for possible future recovery, a check- 
point is garbage if and only if it can not belong to any future recovery line. Being obsolete, 
i.e., before the global recovery line, is simply a sufficient condition for being garbage, but 
not a necessary condition. We first give an example of nonobsolete garbage checkpoints. 
Figure 2.2 is a typical example illustrating the domino effect. The global recovery line 
stays at the set of initial checkpoints and is unable to move forward. The edge from 
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co , 2 to c li2 and the one from to co, 2 imply that co ,2 is inconsistent with any check- 
point of process p\. Since a recovery line must contain one checkpoint from each process, 
Co i2 can not belong to any future recovery line 1 and is therefore a garbage checkpoint. 
Checkpoints ci iX and cq,i are garbage by similar arguments. 



Figure 2.2: Example of nonobsolete garbage checkpoints. 

Figure 2.2 in fact provides another sufficient condition for identifying garbage check- 
points; our optimal gaxbage collection aims at deriving the necessary and sufficient con- 
dition. The difficulty of the problem lies in the fact that future process execution may 
contain any number of operational sessions (with arbitrary checkpoint dependencies) 
and recovery sessions (with arbitrary subsets of processes being faulty). We outline our 
approach as follows. Instead of trying to find garbage checkpoints, we start with identify- 
ing nongarbage checkpoints. Given any possible future recovery line which contains some 
nongarbage checkpoints, for example, the recovery line shown in Fig 2.1, we perform re- 
covery line transformation to transform it into another recovery line which also contains 
those nongarbage checkpoints. Although there are an infinite number of future recovery 
lines containing any nongarbage checkpoint, we prove that they can all be transformed 
into a set of 2 N “immediate future” recovery lines. (Recall that N is the number of 

*It is not hard to see that co,j being a garbage checkpoint will not be affected by the occurrence of 
any recovery session because every rollback either preserves the “triangular” condition in Fig. 2.2 for 
cq i2 or simply invalidates cq, 2 - 
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processes.) Our next step is recovery line decomposition. We identify a set of N recovery 
lines which forms the “basis” for those 2 N recovery lines and therefore contains all of the 
nongarbage checkpoints. 

2.1.2 Recovery line transformation 

Our approach to transforming an arbitrary future recovery line backwards in time 
is to first define two elementary transformations: transformation within an operational 
session and transformation across a recovery session. Any transformation can then be 
achieved through a combination of these two elementary transformations. 
Transformation within an operational session 

During normal process execution, the size of the checkpoint graph increases as new 
checkpoints are taken. Because checkpoint graphs represent program dependencies and 
are not arbitrary directed acyclic graphs, the following rules must be satisfied when 
adding new vertices. For every new vertex c^ with x > 1, 

Rule l.a: c,> must have an incoming edge from c,>_i; 

Rule l.b: c,, x can not have any outgoing edge to any existing vertices because it can not 
happen before a checkpoint that was taken earlier. 

We note that because of the unpredictable message transmission delay during the 
dependency information collection process, the information associated with a checkpoint 
Cj iV that happened before c,> is not necessarily collected by the garbage collection initiator 
earlier than the information associated with c,^ is collected. However, such a situation 
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can be detected based on the dependency information. If a vertex c tiX is supposed to 
have an incoming edge from a nonexisting vertex c J<y , then c * ^ and ail of its incoming 
edges will be temporarily excluded from the current checkpoint graph. By adding each 
new vertex under this constraint, none of the new vertices can have any edge pointing to 
any existing vertices and Rule l.b is therefore enforced. We use Q a (G) to denote the set 
of all potential supergraphs obtainable by adjoining new vertices to a given checkpoint 
graph G without violating Rule l.a and Rule l.b. 

Our transformation procedure generally involves changing part of the recovery line 
of a graph G\ to obtain the recovery line of another graph Gi- The following lemma 
will be used throughout this chapter to ensure that the unchanged part, which forms an 
antichain in Gi, remains an antichain in G? after the transformation. 

LEMMA 3 Given a checkpoint graph G = (V r , E) and its potential supergraph G' = 
{V',E') € Q${G), for any A CV, A is an antichain in G if and only if A is an antichain 
in G'. 

Proof. If A is an antichain in G, then u ft. v for any «,«£ A. Rule l.b guarantees 
that u ft v remains true in G' because there can not exist any w € V' \ V such that 
u < w < v. Hence, A is an antichaun in G'. Conversely, if A is not an antichain in G, 
there must exist u,u 6 A such that u < v which clearly remains true in G', and so A is 
not an antichain in G'. □ 

One special potential supergraph of G, denoted by G, will play a major role through- 
out this chapter. The graph G is constructed by adjoining a new vertex n, at the end of 
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G for each p,, with a single incoming edge from the last vertex /, as shown in Fig 2.3. Let 
L denote the set of all last-nodes /, and B denote the set of all new-nodes n t . We will 
refer to the 2 N graphs G — W, W C B, as the immediate supergraphs of G. The proof of 
the following property defines the recovery transformation within an operational session: 
given the recovery line of a potential supergraph G' of G , by replacing its constituent 
checkpoints which are not contained in G with their corresponding new-nodes of G, we 
obtain the recovery line of an immediate supergraph of G. 



Figure 2.3: Construction of the potential supergraph G. 

PROPERTY 1 For any checkpoint v in a checkpoint graph G, if v belongs to the re- 
covery line of a potential supergraph G' , then v must also belong to the recovery line of 
an immediate supergraph of G. That is, given G = {V,E), v € V and G' € G,{G), if 
v € then v € M*(G - W) for some WCB. 

Proof. We partition M“{G') into M\ U Mi where 

Mi = M m (G')nv 
M 2 = M m (G')\V 
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as shown in Fig. 2.4. A corresponding partition of the new-nodes of G is given as 
B = B\ U Bi such that 

B\ = {n,- : M’(<7)[»] 6 M x ) 

B-x = {n,- : M n {G')[i\ € M 2 }. 

Our goal is to show that 

M*(G - B x ) = Mx U B 2 . 

Then, for any v € V and v € M*(G'), we must have v € Mi C M*(G — W) where 
W = Bx C B. 

First, we show that Mi U flj € M(G — Bx). Define the subset £ 2 of last-nodes 
corresponding to M 2 as L 2 = {lj : M*(G')[j] € M 2 }. Because Mx U M 2 forms an 
antichain in G', we must have M*(G')[t] £ lj for any M*(G')[i] € Mx and € L 2 . Now 
consider G — Bx. We have M*(G')[i] ^ n ; for any r»y € £ 2 because each rij has only a 
single incoming edge from lj. Clearly, any new-node n 3 ^ M"(G')[i]. Lemma 3 further 
guarantees that Mi(C V) remains an antichain in G and also in G — Bx- Hence, we have 
Mx U Bx € M(G - Bx). 

We next prove that Mi U Bx = M*(G — Bx) by contradiction. Suppose Mi U f? 2 ^ 
M“(G — £i). There must exist M[ = M*(G — J?i) \ f? 2 such that M : ' C V', Mi ^ M{ 
and Mx ^ M[ as shown in Fig. 2.4. Now consider G'. Recall that M x and M 2 form 
an antichain in G' and thus for any u € M[ and M*(G')[j] € M 2 , we must have u £ 
M’{G')\j]. We also have M*(G')[j] u by Rule l.b. Therefore, M{ U M 2 forms an 
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Figure 2.4: Recovery line transformation within an operational session. 


antichain in G', contradicting the fact that Mi U M? is the maximal maximum-sized 
antichain of G' . □ 


The transformation within an operational session can be viewed as “projecting” any 
potential supergraph along the direction opposite to the time axis. It shows that although 
the number of potential supergraphs of G is infinite, the recovery lines of these graphs 
can intersect G in only a finite number of ways, and each of the possible intersections 
must be part of the recovery line of an immediate supergraph of G. 


\ 
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Transformation across a recovery session 

Existing vertices on a checkpoint graph, for example, C 3 f3 in Fig. 1.4(c), can be deleted 
due to rollback recovery. Let Ge denote the extended checkpoint graph as defined in 
Section 1.3, G = (V, E) denote the subgraph of Ge without the virtual checkpoints, and 
G~ = (V~ ,E~) denote the checkpoint graph immediately after recovery. Figures 2.5(a)- 
(c) illustrate these checkpoint graphs. Let F denote the part of G deleted by the rollback; 
then we have G~ = G — F. By definition, M m (G E ) is the local recovery line. Let 
M*(Ge) = M r U M v as shown in Fig. 2.5(b) where M r = M*(Ge) H V consists of real 
checkpoints and M v = M"(Ge) \ M r consists of virtual checkpoints. According to the 
rollback propagation algorithm, the following two rules must be satisfied when existing 
vertices are deleted during recovery. 

Rule 2.a: There cannot exist any u € M r and w € V~ such that u < ui, i.e., none of the 
checkpoints in M r can have any outgoing edge in G ~ . 

Rule 2.b: For any u in F, all of the checkpoints reachable by u must also be in F . 
Consequently, none of the checkpoints in F can have any outgoing edge to any 
checkpoints in G~ . 

We also define 

T x = {n< : M m (G E )\i] 6 M r } and T 2 = {n, : M m (G E )\j] € M v }. (2.1) 

In other words, T\ consists of the aew-aode for each process p, which contributes a real 
checkpoint to the local recovery line. Parallel to the definitions of U, n,, B, G, T\ and T 2 
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for G, we define /“, n“, G~ , B~, Tf and T 2 “ for G~ . That is, l~ denotes the last-node 
of pi in G~ , n~ denotes the new-node of p,- in G~ , G~ is obtained by adding n~ to G~ 
for every pi , B~ denotes the set of all new-nodes in G~, Tf — {n~ : M*(Gb)[»] € M r ) 
and Ti = {nj : M m (GE)\j\ € M„}. It is not hard to see that T 2 “ = T 2 . 

We first prove the following lemma which states the relationship between the maximum- 
sized antichains of G and those of its potential supergraphs. 

LEMMA 4 Given a checkpoint graph G = (V, E) and its potential supergraph G' = 
(V\E') € g,(G), for any MCV, 

(a) M € M(G) if and only if M € M{G'); 

(b) M*(G) r< 

(c) if M — M-(G') then M = M*(G). 

Proof. Rule l.a guarantees that the largest size of any antichain remains the same in 
all potential supergraphs. Hence, (a) follows immediately from Lemma 3. In particular, 
M‘(G) € M(G') which leads to (b). If M C V and M = then Af'(G) x M 

from (b) leads to M — M*(G). □ 

The proof of the following property defines the transformation across a recovery ses- 
sion: given the recovery line M of an immediate supergraph of G~, for any i such that 
M[i ] is a new-node and M*{GE)[i\ from the local recovei~y line is not a virtual checkpoint , 
we replace M[x\ with A/“((?£;)[i] to obtain the recovery line of an immediate supergraph 


of G. 
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PROPERTY 2 For any checkpoint v in G~ , if v belongs to the recovery line of an 
immediate supergraph of G~ , then v must also belong to the recovery line of an immediate 
supergraph of G. That is, given G~ = ( V~,E~ ) and v € V~ , if v € M*(G~ — W~) for 
some W~ C B~, then v € M*(G — W) for some W C B. 

Proof. Let Gw = G~ — W~. We partition the recovery line M'{G W ) into M1UM2UM3 
where 

Mi = M m (Gw) n V- 

Mi = {nf 6 M m (Gw) : «?*)[*] 6 M v } 

M 3 = {n~ € M m (Gw ) : M m (G E )\i) € M r } (2.2) 

as shown in Fig. 2.5(f). The two sets of new-nodes B and B~ are partitioned 2 as follows. 

B = B\ U Bi where B\ = {n, : M*{Gw)[i] € M\) (2.3) 

5, = {m : M*(Gw)[i] 4 Mi} 


B~ — Bf U Bf where Bf = {n~ : M"((7^)[i] € Mi} (2.4) 

B; = {nr : JV(G5-)[t] ( M,}. 


Our goal is to show that 

MiUM 2 UM 4 = M m {G-{TiUBi)) (2.5) 

where 

M 4 = {M-(G E )\i} : n~ e M 3 }. 


2 T\ UTh (Eq. (2.1)) is another partition of B corresponding to M’(Ge) = M r UM». 
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Then, for any v € V and v € we have v 6 M x C M'(G — W) where 

W = T x U B x C B. 

First, it is not hard to see that W~ C Bf and so M*(G~ — Bf) = M'(Gw) from 
Lemma 4(c) and the definitions of Bf and M\. We now prove Eq. (2.5) by the following 
steps: (a) M 1 UM 1 UM 4 6 M(G~ - (Tf U Bf)); (b) Mi U M 2 U M 4 6 M (G - (T : U B x ) ); 
(c) M 1 UM 5 UM 4 = M*(G - (T x U Bi)). 

A 

(a) That Mi U Mi U M 3 forms an antichain in G~ — Bf implies, for any u € M 4 and 
iw € Mi U M 2 , that w ft u. This clearly remains true in — (Tf U Bf ). Since Rule 2. a 
guarantees u ^ to, we have Mi U M 2 U M* € A4(G' — (Tf U Bf )). 

(b) By adding all of the vertices in F to G” — (Tf UBf ), we obtain the graph G— (TiUBi) 
as shown in Fig. 2.5(d). Rule 2.b guarantees that the above process will not make any 
unordered pair in G~ — (Tf U Bf ) become ordered. Therefore, Mi U M 2 U M 4 remains 
a maximum-sized antichain in G — (Ti U B\). 

(c) Suppose that Mi U M 2 U M 4 ^ M m (G — (T\ U Bi)). Then, we must have 

MiUM 2 UM 4 < M m {G- (T 1 UB 1 )). (2.6) 

By applying the transformation as defined in the proof of Property 1 to graphs G and 
Ge (Fig. 2.5(b)), we have 

M'{& - Ti) = M r U T 2 , (2.7) 

as shown in Fig. 2.5(e). Since G — Ti is a potential supergraph of G— (Ti U Bi), we have, 
by Lemma 4(b), 


M"(G - (Ti U Bi)) x M"(G - Ti). 


(2.8) 



32 


Equations (2.6), (2.7) and (2.8) and the fact that Mi C Ti and A / 4 C M r imply M 2 UM 4 C 
M m (G — ( T\ U B x )), and there exists M( such that M[ = M m (G — (7\ U B x )) \ (Mi U M 4 ), 
Mi ■< M{ and M\ ^ M[ (as shown in Fig. 2.5(d)). Equations (2.7) and (2.8) further 
guarantee that M( does not intersect F and so must exist in G~ — (7\" U Bf ) and hence 
G~ — Bf. Following the same argument as in the last part of the proof of Property 1 , we 
can show that the existence of leads to a contradiction to the fact that M 1 UM 2 UM 3 = 
M n (G~ — Bf). □ 

Complete transformation 

We now apply Properties 1 and 2 to transforming an arbitrary future recovery line 
containing some nongarbage checkpoints. By repeatedly applying Property 1 within 
every operational session and Property 2 across every recovery session, we demonstrate 
that every such future recovery line of G can be transformed into the recovery line of an 
immediate supergraph of G which preserves all of those nongarbage checkpoints. 

PROPERTY 3 If a checkpoint in G belongs to a future recovery line, then it must also 
belong to the recovery line of an immediate supergraph of G. That is, given G = (V,E) 
and v € V, if t? 6 M*(G') for a future checkpoint graph G', then v € M m (G — W) for 
some W C B. 

Proof. Without loss of generality, we may assume G is in the </th operational session 
and G' belongs to the rth session where r > q. Let denote the checkpoint graph at 
the end of the ith operational session, G~ denote the checkpoint graph at the beginning 


33 


of the same session, and W{ denote a subset of new-aodes of G,. Clearly, v must belong 
to every such intermediate graph. By applying Property 1 to the graph pairs ( G’,G ~ ), 
( Gj — Wj, Gj) where + 1 < j < r — 1 and ( G q — W q ,G), and applying Property 2 
to the graph pairs (Gj ,Gj-i) where q + 1 < j < r, we cam show that v must always 
remain on the recovery line of an immediate supergraph of one of the intermediate graphs 
throughout the transformation procedure. Eventuadly, we have v 6 M’(G— W) for some 
WCB. □ 

Figure 2.6 gives am example demonstrating the recovery line transformation. Fig- 
ure 2.6(a) is the current checkpoint graph G considered for garbage collection. Suppose 
that Fig. 2.6(b) is the extended checkpoint graph when p$ initiates a rollback, then Fig- 
ure 2.6(c) is the checkpoint graph immediately after the recovery. Fig. 2.6(d) shows 
another possible extended checkpoint graph when po initiates a second rollback. Since 
checkpoints A and B are needed for recovery in this case, they should be considered 
nongarbage checkpoints of G. We first apply Property 1 to the graph pairs (Gd,G c ) and 
transform the recovery line of Gd into the recovery line of G g (an immediate supergraph 
of G c ) by replacing X , Y and Z with their corresponding new-aodes of G c , namely, P, 
Q and R , respectively. Then we apply Property 2 to the pair (G c ,Gi,). Since p 3 and p 4 
contribute reed checkpoints C and D, respectively, to the local recovery line in Fig. 2.6(b), 
the recovery line of G g is transformed into the recovery line of G/ (an immediate super- 
graph of Gi) by replacing Q and R with C and D. Finally, by applying Property 1 to the 
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pair ( Gj,G ), we obtain the recovery line of G e (an immediate supergraph of G ) which 
still contains the nongarbage checkpoints A and B. 

2.1.3 Recovery line decomposition 

Property 3 states that the recovery lines of the 2 N immediate supergraphs of G 
contain all nongarbage checkpoints. We next show that there exists a set of N recovery 
lines which forms a “basis” for the 2 N recovery lines: each of the 2 N recovery lines is the 
set of minimal elements of the union of a subset of the N basis recovery lines. Therefore, 
it suffices to find these N recovery lines to identify all nongarbage checkpoints. 

Let X f\Y denote the meet (greatest lower bound) of X and Y in a lattice and min(S') 
denote the set of minimal elements in S. We first show that the greatest lower bound of 
any k maximum-sized antichains cam be obtained as the set of minimal elements in their 
union. 

LEMMA 5 Given a poset P, M 6 M(P) and M ■< M< 6 A4(P) for 0 < * < k — 1 for 
any finite k, define Ao<i<*-i W, = (—((-Wo A M\) A A/ 2 ) ... ) A Mk-i, then 

(a) M ■< f\ Mi € M(P) and ( b ) /\ Mi = min( (J Mi). 

0<i<k-l 0<i<k-l 0<»<Jb— 1 

Proof. Both parts will be proved by induction on k and based on the following theorem 
from Anderson’s book [46]: for any poset Q and € M(Q), the meet (greatest 

lower bound) of Mi and M 2 cam be expressed as 


Mi A M 2 = min(Mi U Mf). 


(2.9) 
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(a) M{P) is a lattice and therefore M 0 A Mi € M(P). Also, M 1 M 0 A M\ because 
M 1 Mo, M 1 M\ and Mo A M\ is the greatest lower bound of Mo and M\. We have 
shown the case it = 2 is true. Assume that it is true for k = n — 1, i.e., 

Ml A M i €M{P). (2.10) 

0<i<n— 2 

Again, the lattice property of M (P) ensiures that 

A Mi = ( A Mi) A M„_i 6 M(P). 

0<i<n— 1 0<i<n— 2 

Equation (2.10) and M 1 M n -\ imply that 

Ml A M <- 

0<»<n— 1 

Therefore, it is also true for k — n and hence we have (a). 

(b) The case k = 2 follows directly from Eq. (2.9). Assume that it is true for k = n — 1, 
i.e., 

A Mi = min( (J M t ). (2.11) 

°<t<n— 2 0<i<n— 2 

Appljdng part (a), Eqs. (2.9) and (2.11) , we have 

A Mi = ( A Mi) A M n _i = min(min( (J M,)UM„_i). 

0<»<n— X 0<.<n-2 0<i<n-2 

Lemma 6 (to be proved next) further gives that 

min(min( |J M,) U M„_i) = min( (J M, U M n _i) = min( |J Mj). 

0<i<n-2 0<i<n— 2 0<i<n-l 


Therefore, by induction, part (b) is true. 


□ 
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LEMMA 6 Given a poset P = (S, <) and X,Y C S , min(X U Y) = min(min(X) U V'). 

Proof. First, we prove mir^X UK) C min(min(X) U Y). For every z £ min(min( X) U 
Y), there exists a z' in min(A’) U Y such that z’ < z. Since both z and z' are in X U Y, 
it follows that z £ min(Y U Y). 

Conversely, we prove min(min(X) U Y) C min(X U Y). For every z £ min(A U Y), 
there exists a z' in XiiY such that z' < z. If z' € min(Y)UY, then we immediately obtain 
z £ min(min(X) U Y). Otherwise, if z' 6 X \ min(A), then there exists a z" € min(X) 
such that z" < z' , hence z" < z, and again we have z £ min(min(A) U Y). □ 

PROPERTY 4 For every W C B and W ? 0, 

AT(G-W) = min( (J M m {G-m)). (2.12) 

»,€W 

Proof. Without loss of generality, let W = {n 0 , n l5 n fc _x} where 1 < k < N. Since 
G-nje Q,(G - W ), M’{G -W)± M m (G - nj) for all 0 < j < k - 1 by Lemma 4(b). 

Now consider the graph G. From Lemma 4(a), we have M m (G — W ) € M(G) and 
M*(G - nj) € M(G) for all 0 < j < k - 1. Let M* = min(Uo<j<it-i M m (G - n : )). From 
Lemma 5, we have 

M m (G-W)± A M m (G-n ,) = M’ € M{G). (2.13) 

0<j<k-l 

Since M*{G — nj)\j ] < nj and thus nj £ M* for all 0 < j < k — 1, every x € M* must 

A A 

be contained in G — W . From Lemma 4(a), we have M* € M. (G — W) and hence 


M n * x M m (G - W). 


(2.14) 
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Combining Eqs. (2.13) and (2.14), we have proved that 

AT(G -W) = M’ = min( (J M m (G - n,)). 

□ 

In particular, the global recovery line M a (G) can be obtained by letting W — B, that 
is, 

M*(G) = min( \J Af*(G - n,)). 

Q<i<N-l 

As an example, we demonstrate the decomposition of M*(G e ) in Fig. 2.6(e) where G e = 
G — {n 0 ,ni,n 3 ,n 4 }. From Property 4 and referring to Fig. 2.7, we have 

AT(Ge) = min(M*(G - n 0 ) U M*((? - n,) U Af(G - n 3 ) U M‘(G - n 4 )) 

= min({A,5,n 2 ,n 3 ,n 4 ,n 0 ,/,ni, J,C,D}) = {A, B, n 2 , C, D) 

which is exactly the recovery line shown in Fig. 2.6(e). 

2.1.4 Predictive checkpoint space reclamation algorithm 

We are now prepared to derive a necessary and sufficient condition for identifying all 
nongarbage checkpoints. 

THEOREM 2 A checkpoint v in a checkpoint graph G is nongarbage if and only if 
v € M’(G — n.) for some 0 < i < N — 1. 

Proof. If v € M"(G — n,) for some 0 < z < N — 1, then v is nongarbage because G — ni 
is a possible future checkpoint graph. Conversely, if v is nongarbage, we have by definition 
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Figure 2.7: Example of the PCSR algorithm. Shaded checkpoints in (a)-(e) belong to 
the recovery lines and the nonshaded checkpoints in (f) are garbage. 
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v € M’(G') for some future checkpoint graph G'. From Property 3, v (E M* ( G — W) for 
some W C B\ from Property 4, 

o€min((J M m (G — n,)) C [j M m {G - m) C (J M*(G-n t ). 

mew mew o<i<N-i 

Therefore, v € M m (G — n t ) for some 0 < t < iV — 1. □ 


Based on Theorem 2 we now present the Predictive Checkpoint Space Reclamation 
(PCSR) algorithm for finding the N recovery lines in Fig. 2.8. Since the rollback propa- 
gation algorithm in Fig. 1.5 is of time complexity 0(|JF|) where \E\ is the total number of 
edges in the checkpoint graph (as every edge visited can be deleted), the PCSR algorithm 
is of time complexity 0(iV|i?|). 


/* N g {G ) denotes the set of nongarbage checkpoints of G */ 

/* N is the number of processes */ 

/* G and n, are as defined in Fig. 2.3 */ 

for each 0 < i < N — 1 { 

apply the rollback propagation algorithm in Fig. 1.5 to the checkpoint 
graph G — ni to find the recovery line; 

all checkpoints in the recovery line except for the new-nodes are included 
in the set N S (G); 

} 

all of the checkpoints not in N g (G) can be garbage-collected. 


Figure 2.8: The Predictive Checkpoint Space Reclamation algorithm. 


An example illustrating the execution of the PCSR algorithm on the checkpoint graph 
G in Fig. 2.3 is shown in Fig. 2.7. All of the checkpoints in G are nonobsolete and 
must be retained according to the traditional algorithm. Our PCSR algorithm, however, 
determines that all of the nonshaded checkpoints in Fig. 2.7(f) can be discarded. 
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2.1.5 Experimental results 

Four parallel programs are used to illustrate the checkpoint space reclamation capa- 
bilities and benefits of the PCSR algorithm. Two of them are CAD programs written for 
Intel iPSC/2 hypercube: Cell Placement and Channel Router; the other two are Knight 
Tour and N- Queen written in the Chare Kernel language, which has been developed as a 
medium-grained machine-independent parallel language [50]. We use the Encore Multi- 
max 510 multiprocessor version of the Chare Kernel. Communication traces are collected 
for these four programs, and trace-driven simulation is performed to obtain the results. 
The checkpoint interval for each program is arbitrarily chosen to be approximately ten 
percent of the toted execution time, as shown in Table 2.1. 

Table 2.1: Execution and checkpoint parameters of the programs. 


Benchmark 

programs 

Cell 

Placement 

Channel 

Router 

Knight 

Tour 

N- Queen 

Number of 
processors 

8 

8 

6 

6 


Intel iPSC/2 
hypercube 

Intel iPSC/2 
hypercube 

Encore 

Multimax 

Encore 

Multimax 

Execution 
time (sec) 

322.7 

469.3 

273.2 

1625.1 

Checkpoint 
interval (sec) 

35 

40 

30 

150 


Figures 2.9-2.12 compare our PCSR algorithm with the traditional algorithm for 
typical executions of the four programs. Each curve shows the number of checkpoints 
which would be retained if the algorithm is invoked after a certain number of checkpoints 
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Number of 
retained 4 q 
checkpoints 

32 
24 
16 
8 
0 

0 8 16 24 32 40 48 56 64 72 80 

Number of checkpoints taken 

Figure 2.9: Nonobsolete and nongarbage checkpoints for Cell Placement. 

Number of 48 
retained 
checkpoints 40 

32 

24 

16 

8 

0 

0 8 16 24 32 40 48 56 64 72 80 88 

Number of checkpoints taken 

Figure 2.10: Nonobsolete and nongarbage checkpoints for Channel Router. 
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Figure 2.11: Nonobsolete and nongarbage checkpoints for Knight Tour. 



Figure 2.12: Nonobsolete and nongarbage checkpoints for N-Queen. 
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have been taken. The domino effect is illustrated by the linear increase in the number 
of nonobsolete checkpoints as the total number of checkpoints increases. The largest 
difference between the number of nonobsolete checkpoints and the number of nongarbage 
checkpoints for each program is 39 versus 7 for Cell Placement, 48 versus 12 for Channel 
Router, 24 versus 10 for Knight Tour and 41 versus 5 for N-Queen. . 

2.2 Upper Bound on the Number of Nongarbage Checkpoints 

As mentioned in Chapter 1, the traditional approach to garbage collection by dis- 
carding only obsolete checkpoints has lead to the common perception that the space 
overhead for uncoordinated checkpointing is potentially unbounded. Theorem 2 not only 

identifies the minimum set of nongarbage checkpoints but also places an upper bound 

* 

N 1 on the number of nongarbage checkpoints because each M’(G — n t ), 0 < i < N — 1, 
consists of N checkpoints. The following property identifies the inherent relations among 
M m (G — n,)’s, and is the key to further improving the N 2 upper bound to the lowest 
upper bound N(N + l)/2. 

PROPERTY 5 For any 0 < i,j < N — 1 and i ^ j, if M*(G — n, •)[;'] ^ nj and 
M*(G — nj)[t] ^ n,-, then M m (G — n<) ss M*{G — n j). 

Proof. From Lemma 4(a), M*(G — n,)[j] ^ n ; implies M m (G — n;) € M [G — n, — n : ). 
Again from Lemma 4(a), M*{G — nf) 6 M(G — nj). We then have M’(G - Hi) ^ 
M m (G-nj). Similarly, M M {G—nj)[i\ ± leads to M*(G — n ; ) ^ M“(G-n t ). Therefore, 

M u (G-m) = M m {G- nj ). □ 
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We note that the efficiency of the PCSR algorithm can be further improved by ap- 
plying Property 5. Suppose that, inside the loop in Fig. 2.8, we have found the recovery 
line M*(G — n<) for all 0 < i < K. Define the index set F[y] for any j > K as 

T\j] = {« : AT(& - «,•)[?] ? n i? 0 < i < K). 

Then, for each later loop index j, the rollback propagation algorithm can be aborted 
when any checkpoint of process p,-, i € T\j], is marked. Because that would mean 
M’(G — n ; )[t] ^ n, and M*(G — nj) is exactly the same as M m (G — n.) by Property 5. 
We are now prepared to prove the second major result. 


THEOREM 3 Let N g (G) denote the set of nongarbage checkpoints of G and N be the 
number of processes. Then, 




N(N + 1) 
2 


Proof. By Theorem 2, we have to consider only the N 2 vertices M‘(G — rc»)L i]t 
0 < i,j < N — 1. First, M m (G — for all 0 < i < N — 1 must be in G and 
must contribute N vertices to N g (G). For the remaining N 2 — N vertices with i j , we 
consider the pair ■)[;'] and M m (G—nj)[i\ one at a time and there are (N 2 — N )/ 2 

such pairs. We distinguish three cases: 


Case 1: M*{G — n*)[;] = and M*(Cr — rij)[t] = n,-. Both new-nodes do not belong to 

N g (G). 

Case 2: M*(G-n,)[/] = nj and M m (G -nj)[i] ^ ni, or M’(G - n,)[j] ^ n 3 and M m (G — 
n i)W = n i- This pair will possibly add one new checkpoint to N g (G). 
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Case 3: M m (G—rii)[j] ^ rij and M m (G—rij)[i] ^ n^. It follows that M m (G— n t ) = M’{G— 
rij) from Property 5, and thus M*{G — n,-)[;] = M'(G — rij){j\ and M"(G — nj)[i] = 
M n (G — n,)[i]. Since M m (G — and M a (G — nj)[t] axe already in N g (G), this 
case does not increase the size of N g (G). 


Therefore, each of the (N 2 — N)/2 pairs can contribute at most one new checkpoint 
to N g (G) and hence 




□ 


We next show that N(N + l)/2 is in fact the lowest upper bound because for any 
N we can construct a checkpoint graph G 9 N as shown in Fig. 2.13 to achieve this upper 
bound. Figure 2.13 shows the nongarbage checkpoints contributed by each of the N 
recovery lines in the PCSR algorithm. All of the N(N + l)/2 checkpoints axe identified 
as nongarbage checkpoints. 

As a final note, the greatest lower bound of N is achieved when none of the (N 2 — N )/ 2 
pairs contributes any nongarbage checkpoint. Coordinated checkpointing protocols [19] 
guarantee that, immediately after a checkpointing session, the last-node of every process 
must be a maximal element; as a result, Case 1 holds for ail pairs, thereby achieving the 
greatest lower bound. 
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Figure 2.13: G* N : The checkpoint graph with N(N + l)/2 nongarbage checkpoints. 
2.3 Optimal Message Log Reclamation 

As described in Section 1.2, some protocols require message logging to record the 
in-transit messages. Message logs thus constitute another source of space overhead in 
addition to checkpoints. It should be noted that message logging can have two purposes. 
If the message logs are used for state reconstruction in the piecewise deterministic model 
as described in the next section, then both the message contents and the ordinal positions 
[30] (or state interval indices [33]), i.e., order of processing, are required. If the message 
logs are used for recording in-transit messages as is considered in this section, then 
only message contents Me important because such messages axe allowed to arbitrarily 
interleave with messages from other incoming channels. For our purpose, a message 
log is nongaxbage if and only if it can become an in-transit message with respect to a 
possible future recovery line. Since our development of the algorithm will be based upon 
the checkpoint graphs, we first define a nongarbage edge as follows. Let (c, )V , c,- x ) denote 
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the edge representing the relation Cj tV < cv, r . Given any consistent global checkpoint 
M, we say a {cj, y ,Ci iX ) intersects M,” if Cj iV < M[j] and M[i] < c, iX . By the definition 
of in-transit messages, we say an edge is nongarbage if and only if it can intersect a 
future recovery line. We will demonstrate that the recovery line transformation and 
decomposition defined in the previous section can also be used to derive the necessary 
and sufficient condition for identifying all nongarbage edges. More precisely, we prove 
that an edge is nongarbage if and only if it intersects M m (G — n,) for some 0 < i < N — 1. 
All the message logs corresponding to the garbage edges can then be garbage-collected. 

2.3.1 Recovery line transformation 

Given any nongarbage edge (c, o') in G which intersects a future recovery line, we 
will show that ( c,c ') must also intersect M*{G — W) for some W C B, after repeatedly 
and alternately applying the transformations within an operational session and across a 
recovery session. 

PROPERTY 6 For any edge (c Jiy , c,>) in a checkpoint graph G, if (c Jry , Cj iX ) intersects 
the recover line of a potential supergraph G' , then (c JiV , c,-^) must also intersect the 
recovery line of an immediate supergraph of G. 

Proof. Suppose (cj, y , c,>) intersects M*(G') = Afi UM 2 where G = (V, E), G' € Q,{G), 
Mi = M m (G') fl V and A/ 2 = M*(G' ) \ V, as shown in Fig. 2.14. We want to show 
that (cj tV ,a, x ) intersects M m (G - B\) = Mi U S 2 , where B\ = {n, : M‘{G')[i\ e Mi} 
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and i ?2 = {«.• : M’{G')[i ] € M 2 }, which is the recovery line obtained by applying the 
transformation within an operational session to 



Figure 2.14: Nongarbage edge in the transformation within an operational session. 

By definition, c, iB < M m (G')[j] and M*(G')[i ] < c, lX . Since c,> € V, we must have 
M m (G%] € Mi and hence M*(G - Bi)[t] = M m (G')[i] < c^. Similarly, if M*(G')b] € 
then c Jiy < M*(G — B\)\j]; otherwise, M m (G')\j ] € M 2 and we must have Cj >y < 
lj < rij = M’(G — Bi)[j}. Therefore, we have shown (c JiV ,c, iX ) intersects M" ( G — Bi) as 
required. □ 
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PROPERTY 7 Let G and G~ denote the checkpoint graphs immediately before and 
after a recovery session, respectively. For any edge (cj iV , c, tX ) in G~ , if{cj <y , q,*) intersects 
the recovery line of an immediate supergraph of G~ , then (c, fy ,c,>) must also intersect 
the recovery line of an immediate supergraph of G. 

A 

Proof. Suppose (c, >y ,c,>) intersects M*(G~ — Sf) = M\ U M 2 U M 3 as shown in 
Fig. 2.15, where Mfs were defined in Eq. (2.2) and Sf was defined in Eq. (2.4). We 
want to show that ( Cj tV , c,>) intersects M*(G — (7\ U Si ) ) = Mi U M 2 U M 4 , where 7\ was 
defined in Eq. (2.1) and B\ was defined in Eq. (2.3), which is the recovery line obtained 
by applying the transformation across a recovery session to M m (G~ — Sf ). 


Mj Mj 



Figure 2.15: Nongarbage edge in the transformation across a recovery session. 

Let Mq = M m (G — ( 7\ U Si)) and M* = M*(G~ — Sf). By definition, c, )V < M*\j] 
and M*[i] < c,^. Following the same arguments as in the proof of Property 6, we have 
M 0 *[i] = M'[i ] < c,> and, if M‘\j] € Mi U M 2 , c JiV < If M*[j] 6 M 3 , we still have 

c hV < MqU) unless c ;>u € M 4 . Since Rule 2.a guarantees that none of the vertices in M 4 
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can have any outgoing edge in G , c hy £ M 4 and therefore we have shown that ( c ]<y , c, iX ) 
intersects A/ 0 * as required. □ 

Combining Properties 6 and 7 and following the proof of Property 3 immediately lead 
to the following result. 

PROPERTY 8 If an edge c,-^) in a checkpoint graph G intersects a future recovery 
line of G, then (c JiV , c,-^) must also intersect the recovery line of an immediate supergraph 
of G. 

2.3.2 Recovery line decomposition 

We will first show that Eq. (2.12) in Property 4 can be expressed in terms of a 
component-wise minimum operation, and then prove that any edge intersecting the 
recovery line of an immediate supergraph must intersect one of the N recovery lines 
M m (G — n,), 0 < i < iV — 1. 

LEMMA 7 Given a checkpoint graph G, W C B and W ^ 0, 

M m (G - W)\j] = min( (J M*(G “ »*)!>']) /or a// 0 < j < N - 1. 

Proof. For any 0 < j < N — 1, we consider the set of checkpoints {M“(G — nk)[j ] : 
rcjt € W} which contains all of the checkpoints of in U n k zwM*{G — n*). Only 
mindJnfcgvv M*(G — ”fc)L?1) can be a rninimal element. From Property 4, M'(G — W) = 
min(U nfc M'(G — n j,)). Since M*(G — W) must contain one checkpoint from each 
process, we have M m (G - W)\j) = min(U nil e*v M‘(G - n fe )[;]) as required. □ 
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PROPERTY 9 If an edge (c ; - iW , c,-,*) in a checkpoint graph G intersects the recovery 
line of an immediate supergraph of G, then (c Jiy , c*,*) must also intersect one of the N 
recovery lines M*(G — n*), 0 < k < N — 1. 

Proof. Suppose that (c^c,-,*) does not intersect any M'(G — n*). Then, for any 
0 < k < N — 1, either (1) c*,* < M*(G — n*)[t] or (2) M*(G — n*)[;] < c J>y . For any 
immediate supergraph G — W of G, if case (1) is true for ail n* € W, then 

c,\x < min( (J M*(G-n fc )(i]) = M*(G- W')[<] 
n „ew 

by Lemma 7; if case (2) is true for some n i € W, then 

M'(G - W)\j] = min( |J AT(G - n fc )[/J) < M'(G - n,)\j] < 

n k ew 

In either case, (c JiV , c,^) does not intersect M*(G — W). Therefore, if (c Jiy , c,, r ) intersects 
M m (G — W), then (c Jttf ,Cj,*) must intersect M m (G — n*) for some 0 < k < N — 1. □ 

Combining Properties 8 and 9, we now present the necessary and sufficient condition 
for identifying all nongarbage edges. 

THEOREM 4 An edge (cj iU , Cj^) in a checkpoint graph G is nongarbage if and only if 
(cj,yi Ci>) intersects M m (G — n*) for some 0 < k < N — 1. 

Proof. If (Cj, y , c, :X ) is nongarbage, (c JiV , c^) must intersect a future recovery line of G 
by definition. From Property 8, (cj iV , c, iZ ) must intersect the recovery line of an immediate 
supergraph of G. From Property 9, (c JiV , q^) must intersect one of the N recovery lines 
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M m (G — n*),0 < k < N — 1 . Conversely, if (cy iV , c,-, x ) intersects any M'(G — nt), then 
( c j,y, Ci, x ) is nongarbage because M m (G — n k ) is a possible future recovery line of G. □ 

Theorem 4 also leads to an optimal message log reclamation algorithm for finding 
all nongarbage message logs: first compute the N recovery lines M‘(G — n k ), 0 < k < 
N — 1; only those message logs with their corresponding edges intersecting any of the 
N recovery lines are nongarbage. In Fig. 2 . 16 , the edge ( E,F ) intersects M m (G — n 0 ), 

A A A 

(G, H) intersects M*(G — n*) and none of the edges intersects M m (G — rii), M’(G — n 2 ) 
or M m (G — riz). Therefore, while all of the edges in Fig. 2.16(f) are nonobsolete, only 
those message logs corresponding to ( E , F) and (G, H ) need to be retained. 

Figure 2.17 shows an example for analyzing the algorithm complexity. The rollback 
propagation algorithm is applied to the checkpoint graph shown in Fig. 2.17(a), and 
(b)-(d) illustrate the steps for finding the recovery line. Since all of the visited edges 
can be removed as shown in the figure, the algorithm is of time complexity 0(|£J|). 
The nongarbage edges can be identified as the remaining incoming edges of the marked 
checkpoints. Since the additional complexity of scanning through the marked checkpoints 
is no greater than 0(|£|), our optimal garbage collection algorithm for identifying all 
nongarbage edges as well as nongarbage checkpoints is of complexity 0(N\E\). 
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Figure 2.16: Example execution of the optimal garbage collection algorithm. 
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PO 
Pi 
P 2 
P 3 
P 4 

(a) 



(c) 




Non-garbage 



Figure 2.17: Example for identifying nongarbage checkpoints and edges. (For the check- 
point graph in (a), (b)-(d) illustrate the step-by-step execution of the roll- 
back propagation algorithm.) 
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3. RECOVERY LINE PROGRESSION 


Having developed a garbage collection algorithm to minimize the space overhead 
for uncoordinated checkpointing, we next address the recovery line progression problem. 
Traditionally, uncoordinated checkpointing, coordinated checkpointing, and the log-based 
approach have been considered three different approaches, each with its own advantages 
and disadvantages as described in Chapter 1. In this chapter, we adopt a different view- 
point and present a unifying framework for all three approaches. Our framework is based 
on uncoordinated checkpointing, because it is the most general approach in terms of 
process autonomy and the assumptions about program behavior. The price paid for such 
generality is the potential domino effect. We then consider checkpoint coordination and 
exploiting piecewise determinism as two mechanisms for eliminating the domino effect 
by inserting additional checkpoints. The concept of lazy checkpoint coordination is in- 
troduced to allow sacrificing a varying degree of process autonomy in exchange for the 
guarantee of recovery line progression. The notion of logical checkpoints is introduced 
to interpret message logging in the piecewise deterministic model as providing additional 
checkpoints available for the processes to roll back to, thereby reducing rollback propa- 
gation. In such a unifying framework, all three approaches can be integrated together 
and benefit from the optimal garbage collection algorithm of Chapter 2. At the end of 
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this chapter, we also describe a message reordering approach to reducing rollback prop- 
agation without inserting any extra checkpoints. For systems in which messages can be 
reordered without affecting program correctness, such a technique can be incorporated 
as an additional mechanism to further advance the recovery lines. 

3.1 Lazy Checkpoint Coordination 

3.1.1 Communication-induced checkpoint coordination 

The basic concept of lazy checkpoint coordination is to insert extra induced check- 
points based on the communication history, in addition to the basic checkpoints initiated 
independently by each process, to ensure that a new consistent set of checkpoints will be 
formed periodically to advance the recovery line. Figure 3.1(a) illustrates a checkpoint 
and communication pattern with the domino effect. A straightforward way of avoiding 
such possibly unbounded rollback propagation is to perform traditional eager checkpoint 
coordination as shown in Fig. 3.1(b), where denotes the xth basic checkpoint of 
Pi. Whenever a process takes a basic checkpoint, a coordination message (dotted line) 
is broadcast to request the cooperation in making a consistent set of checkpoints [11]. 
Let B be the totail number of basic checkpoints and I be the total number of induced 
checkpoints. We define the induction ratio as 
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which is a measure of the overhead for performing communication- induced checkpoint 
coordination. Clearly, eager checkpoint coordination always has 7 Z = N — 1, and the 
N — 1 coordination messages per checkpoint session constitute additional overhead. 

The large overhead of eager checkpoint coordination results from its pessimistic na- 
ture. More specifically, when p\ in Fig. 3.1(b) initiates its first basic checkpoint &14, it 
“pessimistically” assumes that messages like mi will exist in the future and cause 614 
to be inconsistent with its corresponding checkpoint 60,1 on po- To guarantee that 614 
belongs to a useful recovery line, p\ “eagerly” requests po's cooperation at the time 614 
is initiated. In contrast, lazy checkpoint coordination adopts an optimistic approach by 
assuming that 60,1 will be consistent with 614. If the assumption turns out to be true, no 
explicit coordination is necessary. An extra checkpoint will be induced on po only when 
message mi indicates that the assumption has failed [21] (Fig. 3.1(c)). 1 From another 
point of view, such a scheme “lazily” delays the broadcast of the coordination messages 
and implicitly piggybacks them on future normal messages [20]. Both checkpoint and 
message overhead can therefore be reduced. 

However, given a basic checkpoint pattern, the number of induced checkpoints in 
the above scheme is determined by the communication pattern and is not otherwise 
controllable. In the worst case, the induction ratio 'K can still be N — 1 as illustrated in 
Fig. 3.1(c). To further reduce the overhead, we can perform even “lazier” coordination 

l The motivation for lazy checkpoint coordination is similar to the concepts behind the lazy release 
consistency in distributed shared memory [51] and the lazy message cancellation in optimistic distributed 
simulation systems [52], 
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Figure 3.1: Communication-induced checkpointing, (a) checkpoint and communication 
pattern; (b) eager checkpoint coordination; lazy checkpoint coordination with 
(c) laziness = 1 and (d) laziness = 2. 
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by enforcing the consistency only between checkpoints cq^z and ci tTl z where Z is called 
the laziness and n is an integer. Figure 3.1(d) shows the case of Z = 2. No checkpoint is 
induced until message m 2 indicates the inconsistency between 61,2 and 60,2- The number 
of induced checkpoints is reduced from 8 to 2 at the cost of potentially larger rollback 
distance. We will show in the next section that, while the worst-case induction ratio for 
Z — 1 is of order (the number of processes), the upper bound on the induction ratio 
for Z > 2 is related to the maximum ratio of the lengths of any two basic checkpoint 
intervals and is independent of N. 

Lazy checkpoint coordination can be implemented as follows. The laziness Z is a 
predetermined system parameter known to all processes. During normal execution, each 
process p; maintains a variable V which is initialized to be Z and incremented by Z each 
time Ci <n z is taken. When p,- at its xth checkpoint interval is about to process a message 
m tagged with the sender pj’s checkpoint interval number y > V, Pi is forced to take the 
checkpoint c^jz where l = [y/Z\. In other words, ifm was sent after c h iz had been taken, 
then it must be processed by p,- after c+jz is induced. Notice that the induced checkpoint 
Cijz can be referred to as any checkpoint Ci tW with x < w < IZ. Since our approach is to 
incorporate lazy checkpoint coordination into an uncoordinated checkpointing protocol 
(which corresponds to Z = 00) as a mechanism for bounding rollback propagation, the 
optimal garbage collection algorithm remains useful in reducing the space overhead for 
such a domino-free recovery protocol. 
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3.1.2 Worst-case analysis 

Our approach to worst-case analysis consists of two steps. First, given any basic 
checkpoint pattern, we construct the worst-case communication pattern. Second, given 
any system with N processes and laziness Z, we derive the worst-case induction ratio as 
a function of N and Z by considering these worst-case communication patterns. 

For the purpose of presentation, we assume that every checkpoint cf x in a checkpoint 
and communication pattern V is associated with a global time stamp t(cf x ). For any n, 
define c? nZ = c? nZ if t{ c ?, n z) — *( c £«z) for all 0 < j < N — 1, i.e., c? nZ denotes the earliest 
checkpoint #nZ among all processes. Given any basic checkpoint pattern and laziness Z, 
we construct the communication pattern 2 Vq as follows. If c^° nZ = c£° z , then p, sends a 
message to every other process pj and induces cJ° nZ with t(cJ° nZ ) « t{c^° nZ ). Figure 3.2(a) 
shows an example of Vq with Z — 2. We will call the interval between f(c^° n _ 1)Z ) and 
t{c?° nZ ) the induction session #n which includes all of the induced checkpoints c£° z . 

Since the induction of any checkpoint C lnZ (and hence any possible dummy check- 
points c£ v , (n — l)Z < y < nZ ) cannot happen until the first checkpoint #nZ, e.g., c£ nZ , 
is taken, p t - has to take Z consecutive basic checkpoints by itself to reach cf nZ , as stated 
in Property 10. 

PROPERTY 10 If c? nZ = cf nZ , then the Z checkpoints cf^, (n — 1 )Z < x < Z, must 
be basic checkpoints. 

2 When it is clear from the context that the basic checkpoint pattern is fixed, the same notation for 
the checkpoint and communication pattern will also be used to refer to the communication pattern. 
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(b) 


Figure 3.2: (a) Worst-case communication pattern (b) worst-case checkpoint and com- 
munication pattern. 

We show in the next property that, given a basic checkpoint pattern, P 0 has the 
earliest <Z,nZ for any positive integer n. 

PROPERTY 11 Given a basic checkpoint pattern, we have t(c^ a nZ ) < t(c? nZ ) for ar- 
bitrary communication pattern V and any positive integer n. 

Proof. The proof is given by induction on n. Since there can not be any induced 
checkpoint before t(c? z ) for any V , t(c* z ) depends only on the progress of taking basic 
checkpoints. Therefore, t(c?° z ) = t(c? z ) and the case n = 1 is true. For the case n = fc, 
suppose c* kZ = c? kZ . All of the Z checkpoints cf t with (k — l)Z < l < kZ must be 
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basic checkpoints by Property 10. Also, -l )z) — *( c .\(fc-l)z) — t( c ?,i) 5: *( c f,kZ ) by 

definition. Suppose that the case n = k — 1 is true, i.e., < )*)• We 

then have c£* z = where q > kZ because t{c^° (k _^ z ) « f ( c r,°(k-i)z) by construction 
and there are at least Z basic checkpoints of p,-, i.e., the c^j’s, between t(cf° k _^ z ) and 
*(<£**)• Finally, 

*( c **z) — *( c £t2r) — *(«) = — K c ?,kz) 


and we have proved t(cf° nZ ) < t(c^ nZ ) for all positive integer n. 


□ 


It follows from Property 11 that Vo must possess the largest number of cf nZ ’ s. Since 
each cf nZ in Vo also induces the largest possible number (N — 1) of induced checkpoints, 
the total number of induced checkpoints in Vo must be the largest and hence we have 
the following property. 

PROPERTY 12 Given a basic checkpoint pattern, Vo is the worst-case communication 
pattern resulting in the largest induction ratio. 


Property 12 states that, for the worst-case analysis of induction ratio, we have to 
consider only the communication pattern Vo for each basic checkpoint pattern. Since 
every Vo has well-defined induction sessions as shown in Fig. 3.2, the derivations can be 
greatly simplified. 

From Property 10, at least Z basic checkpoints are needed to induce at most N — 1 
checkpoints and thus we have an upper bound on the induction ratio 


1Z < 


N - 1 

Z 


(3.1) 
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It is also the worst-case induction ratio achievable by some Vo for which an example with 
Z — 2 and N = 3 is shown in Fig. 3.2(b). (The stacked checkpoints indicate that each 
dummy checkpoint c^n-i is exactly the induced checkpoint c^ n -) 

The upper bound in Eq. (3.1) was derived under no constraints on the checkpoint and 
co mmuni cation pattern. Since it is O (N), the induction ratio may be unacceptably high 
for systems with a large number of processes. However, a closer look at the two patterns 
in Fig. 3.2 reveals that the situation in (b) which results in the worst-case induction ratio 
is less likely to happen for applications in which the basic checkpoint intervals typically 
do not vary too much. For example in (b), it is very likely for po to take at least one basic 
checkpoint between f(c^) and t(c^g). We will show that under the following constraints, 
which are usually satisfied in many applications, the upper bound on the induction ratio 
is independent of N for Z > 2. (For the case of Z = 1, Fig. 3.1(c) demonstrates that 
the worst-case induction ratio of ( N — l)/Z = N — 1 is always achievable and cannot be 
reduced.) 

Constraint-1: Let Q denote the maximum ratio of lengths of any two basic check-point 
intervals. Although each process is allowed to take its basic checkpoints at its own 
pace, Q is typically bounded by a small constant Q. (For example, Q is 2 or 3 for 
our experiments described in the next section.) Therefore, Q = 0(1). 

Constraint-2: Let L be the number of complete induction sessions in Vo. The applica- 
tions employing checkpointing and rollback recovery axe usually long-running pro- 
grams, which implies Z • L is quite large. In particular, we assume that z-l > rgi. 
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From Property 10, each induction session must contain Z consecutive basic check- 
points and hence at least Z — 1 basic checkpoint intervals at some process. Let S denote 
the following set of integers: 

5 = {m : m • (Z — 1) > Q and m < [(?]}• 

For Z > 2, S contains at least one element, namely, [Q], Let M be the minimum 
element of 5. We define an M-stssion as consisting of M consecutive induction sessions, 
session #((n — \)M + 1) through session #nM. Our approach is based on the observation 
that within an M- session, every process either takes at least one set of Z consecutive 
basic checkpoints which defines one of the induction sessions, or takes at least one basic 
checkpoint due to Constraint-1. Since, within an M- session, the number of induced 
checkpoints is M ■ (N — 1) where M < [Q"] =0(1) and the number of basic checkpoints 
is at least N, the upper bound on the induction ratio is independent of N. 

THEOREM 5 Under the above two constraints, the induction ratio 71 < \Q] for lazi- 
ness Z >2 where Q is the maximum ratio of lengths of any two basic checkpoint intervals. 

Proof. Again we have to consider only Vq for each basic checkpoint pattern. There 
are Lm — [L/M\ complete M- sessions, each containing M ■ {N — 1) induced checkpoints. 
We distinguish the following two cases: 

(a) N < M: From Eq. (3.1), 11 < < N < M < \Q]. 

(b) N > M: First we consider the number of induced checkpoints I. If Z > Q + 1, then 
M = 1 and / = L • (N — 1). If Z < Q + 1, then Z • L [Ql i n Constraint-2 implies 
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L » IQ]. Since M < fQl, we have L/M » 1; thus Lm > 1 and / a Lm • M ■ (N — 1). 
In either case, / as • M • (iV — 1). 

Now consider the number of basic checkpoints B. For each induction session #n, 
the process p, with c^° z = c?° nZ must contribute Z basic checkpoints. Therefore, the 
length of each induction session is at least Z — 1 basic checkpoint intervals. Within each 
M- session, at least N — M processes do not contain c?° nZ for any n. By the definition 
of Q , these N — M processes must each contribute at least |^ Md£zil j basic checkpoints. 
Therefore, 


B > L m • (M • Z + {N - M) • [ M - {Z Q 1 } J ) and 

_ M-(N-l) 

B ~ M-Z + (N-M)- [ tt'.lg- . a j ' 

Since Z > 1 and M > 1 by definition, we have 


M + {N - M) 


<M<[Q] 


(3.2) 


(3.3) 


as required. 


□ 


Combining Eq. (3.1) (for Z — 1 and Case(a)) and Eq. (3.2), we define the refined 
upper bound on the induction ratio R, called the Q- bound, as follows: 


Q - bound = 


M-(N-l) 

M ■ Z + [N > M\- {{N - M) • 


where [N > M] = 1 if N > M is true and 0 otherwise. 
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3.1.3 Experimental results 

Table 3.1 gives the parameters of the four Chare Kernel programs used in the trace- 
driven simulation with lazy checkpoint coordination. The predetermined minimum basic 
checkpoint interval is chosen to be 120 sec. A variable Next.CP.Time is initialized to 120 
sec. Each process checks its local clock after processing every 100 messages. If the clock 
time exceeds Next.CP.Time , then a basic checkpoint is inserted, and Next.CP.Time is 
incremented by 120 sec. The resulting average basic checkpoint interval (CPI) for each 
program is listed in Table 3.1. Before processing a new message, each process also checks 
if it has to take an induced checkpoint, as described in Section 3.1. All reported numbers 
are averaged over five runs. 

Table 3.1: Execution and checkpoint parameters of the parallel programs. 


Programs 

Test 

Generation 

Logic 

Synthesis 

Knight 

Tour 

N- Queen 

Number of processors 

8 

6 

8 

6 

Execution time (sec) 

2,076 

1,736 

2,436 

1,567 

Number of messages 

28,219 

411,733 

104,170 

25,880 

Average number of basic 
checkpoints per processor 

12.6 

11.8 

18.0 

10.5 

Average basic CPI (sec) 

158 

140 

132 

139 

Q 

2.17 

2.48 

1.42 

1.55 

Under-2 percentage 

99.6% 

97.0% 

100% 

100% 


We expect the variation of the basic checkpoint interval to be small because of the 
way it is maintained. In particular, we choose Q = 2 to estimate the induction ratio. The 
exact value of Q for each program is listed in Table 3.1. Although Q is slightly greater 
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than 2 for the first two programs, the numbers listed in the row of “Under-2 percentage” 
show that a very high percentage of the basic checkpoint intervals are covered by Q = 2, 
which thus serves as a good approximation. Figure 3.3 plots the bounds against the 
worst-case ratios computed from Eq. (3.1) and the actual induction ratios (the “Result” 
curve) obtained from the trace-driven simulation for the four programs. It demonstrates 
that the Q - bound provides a good estimate of the induction ratio. The large difference 
in the ratio between Z — 1 and Z > 2 confirms that our generalization of the idea 
of communication-induced checkpoint coordination as described in [21] can significantly 
reduce the extra checkpoint overhead. 

Figure 3.4 plots the average rollback distances in terms of the number of average 
basic CPIs for the four programs. We use 0.5 for Z = 1 and (Z — l)/2 for Z > 2 in 
the “Estimated” curve. Figures 3.3 and 3.4 illustrate that lazy checkpoint coordination 
provides a flexible trade-off between coordination overhead and recovery efficiency. 

3.2 Exploiting Piecewise Determinism 

As described in Section 1.2, when pj in Fig. 3.5 rolls back to c JtV , p, is. also required to 
roll back to undo the effect of message m because the potential nondeterministic events 
preceding the sending of m, for example, the event e*, may invalidate m after the roll- 
back. However, if the event t\ can be detected and recorded, it can then be replayed after 
the rollback to reconstruct the state from which message m was originally sent (under 
the fail-stop assumption). Since the exact copy of m is guaranteed to be resent during 
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(a) Test Generation 


Laziness (Z) 



(b) Logic Synthesis 


Laziness (Z) 



(c) Knight Tour 


Laziness (Z) 



(d) N-Queen 


Laziness (Z) 


Figure 3.3: Checkpoint coordination overhead (induction ratio) as a function of laziness. 
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Figure 3.4: Average rollback distance as a function of laziness. 

reexecution, the state of p*, which depends on the processing of m, remains valid and does 
not have to be rolled back. Motivated by the above observation, the log-based approach 
[26] assumes the piecewise deterministic (PWD) model in which the process execution is 
viewed as consisting of a number of state intervals with completely deterministic execu- 
tion, each started by a detectable and recordable nondeterministic event, for example, 
processing a new message. It has been shown that under the PWD assumption, addi- 
tional event logging allows deterministic state reconstruction to effectively eliminate the 
domino effect [26,33]. Because the concept of state consistency in the PWD model is 
fundamentally different from the one described in Section 1.2, the log-based approach 
has traditionally been presented using a different dependency model (as described in the 
next chapter). 

We first present a unified dependency model for both PWD and non-PWD scenarios. 
Our approach is to introduce the notion of logical checkpoints which allows the happened 
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Logical checkpoints 

Figure 3.5: Nondeterministic events and logical checkpoints. 

before dependency model to be applied to systems with piecewise determinism as well. 
Suppose t\ and e 2 axe the only nondeterministic events of pj in Fig. 3.5. We call Cj <y a 
physical checkpoint which allows the process state to be restored by simply loading the 
saved state on stable storage back to memory. With the PWD assumption, restarting 
the execution from Cj tV also guarantees that the state up to the point immediately before 
the next nondeterministic event e 2 can be reconstructed, effectively placing a logical 
checkpoint L\ before ei. In addition, if event e\ is logged and can be replayed, then 
Pj ' s capability of state reconstruction equivalently places another logical checkpoint 
immediately before e 2 . Therefore, while p : physically rolls back to Cj >y , it logically rolls 
back to Z, 2 and does not unsend m. In other words, while Cj iV and Ci (I+1 are inconsistent, 
Li and c ,> + 1 are not ordered by the happened before relation and thus form a consistent 
set of checkpoints. 

It becomes clear that the PWD assumption allows the use of event logging to increase 
the number of available (logical) checkpoints, thereby reducing rollback propagation and 
avoiding domino effects. However, the assumption that every nondeterministic event in 
the entire execution is detectable, recordable and replayable may not be valid for many 
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applications [53,54]. For example, replaying real-time clock values, sensor readings or 
external resource status may not be meaningful. Therefore, instead of viewing the PWD 
assumption as a restriction imposed upon the program behavior, we consider exploiting 
the PWD model as a mechanism for reducing rollback propagation. More specifically, 
we use uncoordinated checkpointing as the basic scheme and exploit the PWD model 
whenever it is valid in order to effectively advance the recovery line. 

Figure 3.6 gives an example. In Fig. 3.6(a), no piecewise determinism is exploited 
and the domino effect forces the global recovery line to stay at the very beginning of pro- 
cess execution. If piecewise determinism can be fully exploited, as shown in Fig. 3.6(b), 
then we have an additional logical checkpoint before every nondeterministic event. Fig- 
ure 3.6(c) shows a situation in which the PWD assumption is not valid for the parts of 
the execution indicated by the shaded bars, and hence the logical checkpoints in those 
regions are unavailable. Note that once a process turns off the PWD mode, it cannot re- 
sume PWD execution until the next physical checkpoint because the state reconstruction 
process for current checkpoint interval has been interrupted. 

Figure 3.7(a) gives the corresponding (logical) checkpoint graph and (b)-(f) show 
the N recovery lines produced by the PCSR algorithm of Section 2.1.4. We note that 
since the logical checkpoint Lq in (b) is nongarbage, the physical checkpoint Co and the 
message log of m 0 (Fig. 3.6(c)) are nongarbage, and m 0 must be the first new message to 
be processed after po restarts from cq. In contrast, the two dark edges in (f) are identified 
as nongarbage edges by algorithm of Section 2.3, which means that the contents of the 


73 




Figure 3.6: Piecewise determinism and the availability of logical checkpoints. (The 
shaded bars indicate those parts of process execution which do not satisfy 
the PWD assumption.) 
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Figure 3.7: Optimal garbage collection for partially exploited piecewise deterministic 
model. (For the checkpoint graph in (a), (b)-(f) are the N immediate su- 
pergraphs used in the PCSR algorithm.) 
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message logs of mi and m 2 are nongarbage but the original order of processing can be 
discarded. 

3.3 Scheduling Message Processing 
3.3.1 Message prioritization 

Most checkpointing and recovery techniques have assumed that the communication 
pattern is determined by program behavior and is not otherwise controllable; hence, 
the only way of reducing rollback propagation is to change the checkpoint pattern by 
inserting additional checkpoints. However, in a message-driven system such as the Chare 
Kernel [55], the communication pattern cam often be determined by the run-time support 
system in a user- transparent way as well as by program behavior. Since the order in which 
the messages arrive at the receiver can not be assumed, changing the order of message 
processing will typically not affect program correctness. We next describe how messages 
can be prioritized according to the checkpointing and communication history in order 
to control the communication pattern to reduce rollback propagation [56]. Essentially, 
our message scheduling algorithm is based on the following two concepts: dependency- 
redundant messages and message aging. 

A message m is a dependency-redundant message if its immediate processing will 
result in a dependency that has already been implied in the transitive closure of the 
current checkpoint graph. Since the recovery line is determined by the transitive closure, 
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processing a dependency- redundant message will not cause any additional rollback prop- 
agation. For example, suppose messages m 0 and m i in Fig. 3.8(a) have been processed. 
Then message m 2 becomes a dependency- redundant message because its corresponding 
dependency A < E is implied through A<B<C<D<E, as shown in (b). Similarly, 
m 3 is a dependency- redundant message because B < F is implied through B < C < F. 


a a 



D E 


(a) (b) 

Figure 3.8: Dependency-redundant messages, (a) Checkpoint and communication pat- 
tern; (b) checkpoint graph. 

The concept of message aging is motivated by the communication-induced check- 
pointing schemes [7,57,58], in which a checkpoint is inserted immediately after every 
message is sent. Since such schemes guarantee that the rollback of any process will not 
unsend any messages, rollback propagation and hence the domino effect are completely 
eliminated. However, experimental results have shown that the major disadvantages of 
such schemes are the uncontrollability of the checkpoint frequency [59] and the possibly 
excessive number of induced checkpoints [60]. Instead of forcing the message senders to 
take additional checkpoints to ensure that every message sent is processed after its sender 
takes the next checkpoint, message aging encourages the receivers to delay the processing 
of each message m until its sender passes the next checkpoint. (Such a message m will 
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be called an aged message.) The advantage is that the number of checkpoints and the 
checkpoint frequency can be independent of the communication patterns; the potential 
disadvantage is that either the delayed processing might result in run-time overhead, or 
some processes may be forced to process nonaged messages and hence the system would 
no longer be free of rollback propagation. 

3.3.2 Implementation 

For the receiver to detect aged messages, an additional piece of information has to be 
piggybacked on each message: the time to the next checkpoint of the sender when the 
message is sent. The receiver can then properly manage its message queue based on this 
information. 

Instead of keeping messages from different processes in the same queue, each process 
maintains an array of subqueues, one for each process, and a highest- priority safe queue 
for holding dependency- redundant messages and aged messages. Three additional data 
structures axe needed for proper queue management: 

1. Last.Update.Time records the time at which the most recent update of the time- 
to-next-checkpoint information was completed. It is needed for the aging operation 
described later. 

2. LastJ(nown.CPJVum[N] is an array, with one entry for each process, recording the 
most recent checkpoint interval number of every process that is known to the local 
process based on the communication history. 
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3. Last-Processed-CPJVum[N ] is an array recording the highest checkpoint interval 
number of the processed messages from each process. It is used for identifying 
dependency-redundant messages. 


The updates of the time-to-next-checkpoint information and priorities take place only 
when a new message arrives ( enqueueing) or when the process is about to process the next 
message {dequeueing). The operations performed on the message queue for enqueueing 
and dequeueing are outlined in Figs. 3.9 and 3.10, respectively. The aging operation 
updates the time-to-next-checkpoint information of the last message in each nonempty 
subqueue by the amount of the difference between current time and Last-Update-Time. 
If the time to the next checkpoint of a message becomes negative, all of the messages in 
the same subqueue are moved to the safe queue. 


/* message m from the ith checkpoint interval of p 3 arrives at the message 
queue Q on p r */ 
perform aging operation on Q ; 
if (i < Last-Known.CP-Num\p,]) 
add m to safe queue; 
else { 

if (i > Last-Known-CP-Num\p,]) 

Last-Known-CP-Num\p,\ = i; 
if (i < Last-Processe(LCP-Num\p s ]) 
add m to the safe queue; 
else 

add m to subqueue[p,]; 


Figure 3.9: Operations for enqueueing. 
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/* p r is about to choose a message from queue Q */ 

perform aging operation on Q; 
if (safe queue is nonempty) 

choose a message from the safe queue; 
else { 

choose the message m with the smallest time- to- next-checkpoint; 
move the remaining messages in the same subqueue to the safe queue; 

/* if m is from the tth checkpoint interval of p, */ 

Last-ProcesscdLCP -Num\j>,\ = i. 

} 


Figure 3.10: Operations for dequeueing. 

3.3.3 Experimental results 

Table 3.2 gives the execution parameters of the four Chare Kernel programs used to 
obtain the experimental results for the message scheduling algorithm. The checkpoint 
interval is chosen to be 25 sec. An offset of 1 sec between the corresponding checkpoints 
of processes p,- and pj + i (0 < i < N — 1) is introduced to study the effect of checkpoint 
asynchrony on the rollback distances. 


Table 3.2: Execution parameters of the parallel programs. 


Chare kernel programs 

Matrix 

Multiplication 

Circuit 

Extraction 

Knight 

Tour 

N 

Queen 

Number of processors 

4 

r 4 

6 

■a ii 

Number of messages 

1216 

1315 

13118 

1622 


Figure 3.11 compares the performance of three different message scheduling algo- 
rithms: Last-In-First-Out (LIFO), First-In-First-Out (FIFO) and our PRIoritized Mes- 
sage Process Scheduling ( PRIMPS) algorithm. The percentage numbers indicate the 
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performance degradation of PRIMPS with respect to FIFO. Figure 3.12 compares the 
average rollback distances in terms of the number of checkpoint intervals for the three 
algorithms. Figures 3.11 and 3.12 show that our message scheduling algorithm can effec- 
tively reduce average rollback distances with little performance degradation. Figure 3.13 
illustrates the sensitivity of the three algorithms to the degree of checkpoint asynchrony 
by varying the offset between corresponding checkpoints for the N-Queen program. It 
shows that our scheduling algorithm has the additional advantage of being much less 
sensitive to checkpoint asynchrony than are LIFO and FIFO. 

Execution ^ UFO 

time (sec) ■ FIFO 


■ PRIMPS 



Matrix Circuit Knight N 

Multiplication Extraction Tour Queen 


Figure 3.11: Execution times and performance degradation of the message scheduling 
algorithm. 
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Number of 
Checkpoint Intervals 



Matrix Circuit Knight N 

Multiplication Extraction Tour Queen 


Figure 3.12: Average rollback distances. 


Number of N- Queen 

Checkpoint Intervals 



0 12 3 


Offset Per Processor (sec) 

Figure 3.13: Sensitivity of average rollback distances to checkpoint asynchrony for the 
N-Queen program. 
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4.. RELATED WORK 


4.1 Checkpoint Dependency and Interval Dependency 

Johnson and Zwaenepoel [33] derived a lattice model for reasoning about recovery 
in message- passing systems under the assumption of piecewise determinism (PWD). Let 
(i, x) denote the zth state interval of process p*; they define a dependency relation on 
the state intervals as follows: (i,x) directly depends on (j, y) if 

• i = j and x = y + 1; or 

• (i, x) is started by a message sent from (j, y). 

The transitive closure of the above relation gives the complete state interval dependency. 
A system state consists of N state intervals, 1 one from each process. A consistent system 
state is a system state, of which no two constituent state intervals (i, z) and (j, y) can have 
(i, z) depending on (j, y + 1) or (j,y) depending on (t,x + 1). A state interval becomes 
stable , i.e., recreatable, when all of the messages processed since its immediate previous 
physical checkpoint, called its effective checkpoint , have been logged. A consistent system 
state in which all constituent state intervals are stable is called a recoverable system state. 
The recoverable system state with each of its constituent state intervals as advanced as 

1 Johnson and Zwaenepoel originally defined a system state to be an N x N matrix of which the rows 
are the transitive dependency vectors. Here we adopt the simplification suggested by Sistla and Welch 
[31]. 
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possible is called the maximum recoverable system state. Johnson and Zwaenepoel have 
derived the lattice model and proved the uniqueness of the maximum recoverable system 
state by using the following approach: 

Step 1: the set of system states forms a lattice S ; 

Step 2: the set of consistent system states forms a sublattice C of S; 

/ 

Step 3: the set of recoverable system states forms a sublattice TZ of C; 

Step 4: the maximum recoverable system state is the unique maximum in the lattice TZ. 

As described in the previous chapter, by representing every state interval as a logi- 
cal checkpoint at the end of that interval, the same dependency definition based on the 
happened before relations among checkpoints as used in a non-PWD scenario can still 
be applied to the logical checkpoints. By referring to the logical checkpoints correspond- 
ing to the stable state intervals as stable logical checkpoints and the global checkpoints 
containing only stable logical checkpoints as stable global checkpoints , the approach of 
Johnson and Zwaenepoel can be translated into: 

Step 1: the set of global checkpoints forms a lattice <S; 

Step 2: the set of consistent global checkpoints forms a sublattice C of 5; 

Step 3: the set of stable consistent global checkpoints forms a sublattice TZ of C; 


Step 4: the recovery line is the unique maximum in the lattice TZ. 
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As a comparison, our derivation of an alternative lattice model as described in Sec- 
tion 1.4 has followed different steps. 

Step 1: the set of logical checkpoints forms a poset P; 

Step 2: the set of stable logical checkpoints forms an induced subposet R of P\ 

Step 3: the set of stable consistent global checkpoints is equivalent to the set M(R) of 
maximum-sized antichains of R and thus forms a lattice; 

Step 4: the recovery line is the unique maximal maximum-sized antichain in the lattice 
M(R). 

We believe our maximum-sized antichain model has several advantages. First, there is 
a strong intuitive connection between the antichains of the R-p poset based on checkpoint 
dependencies, and the concept of concurrency and therefore consistency. A consistent 
global checkpoint must consist of local checkpoints that could have happened simulta- 
neously, and this is precisely captured by the requirement that these local checkpoints 
be unordered by happened before. Johnson and Zwaenepoel’s lattice of system states, 
while perfectly adequate from a formal standpoint, lacks this intuitive force. Further- 
more, modelling consistent global checkpoints as maximum-sized antichains enables the 
use of well-known properties of posets to derive many important results. For example, 
these properties have played a very important role in our development of the optimal 
garbage collection algorithm. As another example, our demonstration of the existence 
of a lattice structure among consistent global checkpoints, and hence the uniqueness of 
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the recovery line, follows from general theorems about antichains in posets, as contrasted 
with Johnson and Zwaenepoel’s more “low-level” development. 

4.2 Checkpoint Graphs and Local System Graphs 

Bhargava and Lian [5] consider the same uncoordinated checkpointing protocol as 
described in Section 1.3 but use a different kind of dependency graph to determine the 
recovery lines. In their local system graph, when a message sent from checkpoint in- 
terval (j, y) is processed in (i, x), an edge is drawn from c hy+x to Ci, x +i a s shown in 
Fig. 4.1(b), in contrast to the corresponding edge (c JiV , c^+i) in our checkpoint graph. 
Such a dependency definition can be viewed as extending Johnson and Zwaenepoel’s 
interval dependency for state intervals to checkpoint intervals in a non-PWD scenario, 
i.e., (i,x) depends on (j, y). A significant difference, though, is that such an extension 
results in possibly cyclic directed graphs. An alternative interpretation can be called the 
rollback dependency, i.e., the rollback of c JiV+1 will cause the rollback of c,-, x+ x . 

The local system graph corresponding to the checkpoint graph in Fig. 1.4(b) is shown 
in Fig. 4.1(c). A virtual checkpoint is added at the end of the graph for every process 
to represent the current state. To determine the global recovery line, all of the virtual 
checkpoints are initially marked to simulate the situation in which ail of the processes axe 
rolled back. All of the checkpoints reachable by these initially marked virtual checkpoints 
are then searched and marked, and the latest unmarked checkpoint of each process forms 


C -Z. 
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or (j,y) 



Figure 4.1: Rollback dependency and local system graphs. For the message in (a), (b) 
illustrates the rollback dependency edge. Local system graphs (c) and (d) 
are for determining the global and local recovery lines, respectively. 

the global recovery line. The local system graph corresponding to the extended check- 
point graph in Fig. 1.4(c) is shown in Fig. 4.1(d). Only the virtual checkpoint belonging 
to the failed process p\ is initially marked. The local recovery line again consists of the 
latest unmarked checkpoint of each process. 
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4.3 Non-fail-stop Failures and Software Error Recovery 

Much of the literature on checkpointing and rollback recovery is based on the as- 
sumption of fail-stop hardware failures. As described in Section 1.1, the only cause 
for rollback propagation under such an assumption is the potential nondeterminism. In 
practice, hardware failures can be non-fail-stop due to nontrivial error detection latencies; 
hence, the possibly corrupted messages constitute another source of rollback propagation. 
One way to build a on-fail-stop recovery protocol on top of a fail-stop recovery protocol 
is to exclude the potentially corrupted checkpoints of the failed processes from the check- 
point graphs. The rollback propagation algorithm (Fig. 1.5) can then guarantee that 
all of the possibly corrupted messages and the potentially contaminated checkpoints of 
the receiving processes do not affect the computation of the correct recovery line. If the 
maximum error detection latency, possibly different for various types of errors, is known 
in advance, we can simply exclude the checkpoints belonging to the maximum latency 
range [61]; otherwise, multiple retries can be performed by discarding more checkpoints 
when a previous retry fails. For applications in which the output commit is an impor- 
tant issue, only the checkpoints and message logs beyond the last output commit can be 
excluded [26]. 

Recently, checkpointing and recovery techniques have also been applied to the in- 
creasingly important area of software error recovery [42,62-70]. Unlike the recovery 
block approach [2] and N-version programming [71] which both use different programs 
to execute on the same set of data, the on-line retry approach based on checkpointing 
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and rollback [65, 66] uses the same program to operate on a different but consistent set 
of data [72, 73] obtained through the inherent nondeterminism, and has been shown to 
be effective in bypassing software errors to improve system availability [67]. 

We have proposed a progressive retry technique [42] based on the log-based approach 
for software error recovery. It is based on the observation that, in many long-life software 
systems, software errors can be recovered by “localized” retries without affecting the 
other parts of the systems. Therefore, the scope of rollback should be controlled by 
progressively discarding more and more message log information as a previous retry fails. 

We will refer to Fig. 4.2 for the following discussion. First, the failed process p 2 is 
restarted from a previous checkpoint and replays the logged messages in their original 
order to reconstruct the process state before the failure. This Step-1 retry will succeed if 
the error was caused by some transient problems such as concurrency conflicts that may 
simply disappear after the rollback. When Step-1 retry still leads to an error, the failed 
process starts a second attempt by reordering the messages; for example, p? in Fig. 4.2(b) 
can reorder M a and Mb- (We note that M c becomes an orphan message with respect 
to the recovery line and hence cannot be used in the reordering.) This Step-2 retry can 
be useful when the error was due to some untested boundary conditions [63] and the 
reordering can bypass that condition. In some cases, the software errors are triggered by 
messages suffering from unexpected transmission delay in the communication channels 
(message M d in Fig. 4.2(c)); Step-3 retry thus forces the sender to resend the messages 
to obtain a “normal” interleaving of messages. If Step-3 still fails, it implies that the 
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Figure 4.2: Progressive retry, (a) Step 1: message replay (b) Step 2: message reordering 
(c) Step 3: message resending (d) Step 4: message revocation. (Shaded logical 
checkpoints indicate the recovery lines; circled physical checkpoints indicate 
the restarting checkpoints for rolled-back processes; in-transit messages are 
drawn in dashed lines; orphan messages are drawn in dotted lines.) 


\J7 

<d) 


Figure 4.2: (continued) 
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above messages may have been corrupted in the first place; Step-4 retry then further 
rolls back the senders in order to revoke the possibly corrupted messages ( M a and Mb 
in Fig. 4.2(d)). When all previous retries have failed, Step-5 retry rolls back the entire 
system to a previous consistent global checkpoint as a final attempt. The progressive retry 
technique has been used in an AT&T telecommunication billing system and a replicated 
file system at Bell Laboratories as an economical way of recovering from certain software 
errors [42,74]. 
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5. CONCLUSIONS 


5.1 Summary 

This thesis has derived a necessary and sufficient condition for identifying all garbage 
checkpoints and message logs in an uncoordinated checkpointing protocol. We proved 
that there exists a set of N recovery lines such that any checkpoint useful for a possible 
future recovery must be contained in one of the N recovery lines, and any useful message 
must be an in- transit message with respect to one of the same N recovery lines. An 
optimal garbage collection algorithm of time complexity 0(-/V|£'|), where N is the number 
of processes and \E\ is the number of edges in the checkpoint graph, has been presented to 
identify all nongarbage checkpoints and message logs; the storage space of the remaining 
checkpoints and message logs can then be reclaimed. In addition, we have demonstrated 
that the lowest upper bound on the number of nongarbage checkpoints is N(N + l)/2, 
as opposed to the common perception that an uncoordinated checkpointing protocol has 
to maintain a potentially unbounded number of useful checkpoints. 

A unifying framework has also been proposed to integrate the three traditionally sep- 
arated approaches into one flexible checkpointing and recovery scheme. The framework is 
based on uncoordinated checkpointing to allow maximum process autonomy and general 
nondeterministic executions, employs lazy checkpoint coordination to control the coordi- 
nation frequency and to eliminate the domino effect, and exploits piecewise determinism 
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whenever possible to further advance the recovery line. It was then demonstrated that 
the optimal garbage collection algorithm can be applied to such a domino-free recovery 
protocol to minimize the space overhead. 

5.2 Limitations and Future Research 

The optimal garbage collection algorithm developed in this thesis is a centralized 
algorithm based on global dependency information obtained through direct dependency 
tracking. As demonstrated in Appendix A, in spite of the possible missing happened 
before relations, direct dependency tracking maintains sufficient information for deter- 
mining consistent global checkpoints. A potential research topic is to study the trade-off 
between the cost of dependency tracking and the degree of algorithm decentralization. 
On the one hand, a new dependency tracking mechanism may be devised to record the 
minimum information sufficient for recovery line computation. Such a scheme would 
require the collection of global information. On the other hand, transitive dependency 
tracking [26, 33] or antecedence graph tracking [40] may allow decentralized g cubage col- 
lection based on partial dependency information at the cost of more complicated tracking 
mechanisms. 

A more aggressive approach to reducing space overhead would be to avoid garbage 
checkpoints in the first place. It is not possible to avoid taking a garbage checkpoint 
because any new checkpoint must be a maximal element in the poset at the time it is 
taken and hence must be a nongarbage checkpoint according to the algorithm. Such a 
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checkpoint can become a garbage checkpoint through dependency relations with future 
checkpoints. Hence, it is possible to insert additional checkpoints based on dependency 
tracking to avoid generating garbage checkpoints. Xu and Netzer [75] have proposed 
an adaptive checkpointing scheme based on the notion of zigzag paths (as described in 
Appendix A). An extra checkpoint is inserted whenever a backward zigzag path is about 
to be formed. Since the zigzag paths sure, in general, not on-line trackable, causal path 
tracking is used as an approximation to allow the decision to be made based on local 
information. For systems allowing message reordering, the message scheduling algorithm 
as described in Section 3.3 can be combined with the above scheme to reduce the number 
of extra checkpoints. Alternatively, a checkpoint coordinator which possesses global 
information may give advice to other processes as to when to take appropriate checkpoints 
in order not to generate garbage checkpoints. Such an approach can also contribute to 
advancing the recovery line. 
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APPENDIX A. MISSING DEPENDENCY IN DIRECT DEPENDENCY TRACKING 


Recall that the checkpoint graphs based on direct dependency tracking as described 
in Section 1.3 record the dependency information, denoted by < d , in the following form: 
c i,v <d ^ and only if 

1. i = j and x = y\ or 

2. i j and there is a message m sent from (j, y) and received in (t, x). 

We will denote by < c the transitive closure of < c d . 

Figure A. 1(d) shows the checkpoint graph corresponding to the checkpoint and com- 
munication pattern V in Fig. A. 1(a). A close look at Figs. A. 1(a) and (d) reveals that 
the checkpoint graph does not capture the complete happened before relation among 
all checkpoints. For example, while Ck, x < c,-, x+ i is clearly visible in Fig. A. 1(a), the 
corresponding relationship Ck, x < c c, J+ i is absent from Fig. A. 1(d). In other words, the 
poset Rfp = (C-p, < e ) constructed through direct dependency tracking is not equal to 
the poset Rv from which we built our model of consistent global checkpoints. This sort 
of dependency information is lost precisely because of the lack of transitive dependency 
information propagation in the direct dependency tracking mechanism. 

We will demonstrate that, despite the missing dependencies, R % has exactly the same 
set of maximum-sized antichains as does R-p , and therefore that the checkpoint graph 
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Figure A.l: Three different checkpoint and communication patterns with the same check- 
point graph. The ordering Ck z < Ci^r+i implied by (a) and (b) is missing from 
(d). 

suffices to enable the determination of consistent global checkpoints, and of recovery 
lines in particular. Intuitively, while the inconsistency between Ck, z and c, iX+1 caused 
by the happened before relation Ck, z < Ci^ +i is missing from Fig. A. 1(d), the global 
inconsistency in the sense that and c,> + i cannot co-exist in any consistent global 
checkpoint is implied through the “zigzag” from Ck, z to c JiV +i to c hV to Ci, x +i- 

Xu and Netzer [75, 76] introduce the notion of zigzag paths as follows: a zigzag path 
exists from Cj iV to Ci ^ if and only if there exist messages m^mj, ..., m n (n > 1) such that 


1. mi is sent by pj after c JiV ; 
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2. if mi( 1 < / < n) is processed by pk in ( k,z ), then m( +1 is sent from (k, z) or a later 
checkpoint interval (note that m/+i may be sent before or after mi is processed); 

and 

3. m n is processed by p< before c,-,*. 

Figure A. 2(a) gives an example of a zigzag path from Cj tV to Ci tX . Figure A. 2(b) shows a 
causal path from Cj tV to c,>, which is a special case of the zigzag path and results in the 
happened before relation between the two checkpoints. It becomes clear that the notion 
of zigzag paths is a generalization of Lamport’s happened before relation to address the 
“glob<il consistency” issue. In particular, it has been shown that the existence of any 
zigzag path between two checkpoints excludes the possibility of their belonging to the 
same consistent global checkpoint, as stated in the following property [75]. 

PROPERTY 13 If there exists a zigzag path from Cj, y to ci yXl then c J>y and c,, x cannot 
belong to the same consistent global checkpoint. 

It is not hard to see that, if all of the send events precede all of the receive events in 
the same checkpoint interval as shown in Fig. A. 1(c), then the poset R5p corresponding 
to the transitive closure of the checkpoint graph is exactly the poset R-p corresponding to 
the checkpoint and communication pattern. Given any checkpoint and communication 
pattern V, our approach is to transform V into smother pattern V* with the above 
property, without affecting the set of consistent global checkpoints. Denote by r' y the 
first receive event in (j, y) if there is one, or the checkpoint event c ;iV+1 


otherwise. 
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(a) 


(b) 


Figure A.2: (a) Zigzag path and (b) causal path. 

Let m be a message in V sent after r* y from (j, y) and processed in (», x), as shown in 
Fig. A. 1(a). We will denote the send and receive events for this message by and 
r£, respectively. The transformation on V is defined as follows: for each such message 
m, we first add a message m' with s™ y < r* y and r?£ = r£. (as shown in fig. A. 1(b)), and 
then remove the message m. We are now prepared to show that the missing dependencies 
do not affect the determination of consistent globed checkpoints. 


PROPERTY 14 For any checkpoint and communication pattern V, M{Rp ) = M{RFp). 
Hence, the poset corresponding to the transitive closure of the checkpoint graph for V suf- 
fices for the determination of consistent global checkpoints of V . 

Proof. It suffices to prove that M(Rp) = M(Rp-) because Ftp* = B5p. = RFp. We 
consider any globed checkpoint M in Rp. If M € M(Rp), then c l jt c 2 in Qp for any 
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Ci,C 2 € M. Since adding any message m' in the transformation can introduce only the 
relation 3™ y < that is already implied in V through s™ y < s™ y < r™ x , and removing 
any message m cannot make any originally unordered pair become ordered, we must have 
Ci -ft cj in Q-p • as well and thus M 6 M{Rv). 

If M £ M(R-p), there must exist ci,C 2 € M such that cj < c 2 in Q p. Since every 
message m' in the transformation is sent in the same checkpoint interval as is its corre- 
sponding message m, every zigzag path in V remains a zigzag path in V m . That ci < c 2 
in Q? implies there exists a causal path, and hence a zigzag path, from Ci to c 2 . Since 
that zigzag path must still exist in P*, we have M £ M(Rp-) by Property 13. Therefore, 
we have shown that M(R-p) = M{Rv) and thus M(R-p) = M(R^) as required. □ 
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